[PATCH] dmr/amdgpu: Fix wrongly unref of BO

Wed Apr 19 11:37:14 UTC 2017

Hi Christian,

Without correctly kunmap, page table is corrupted. Page entries point to wrong memory locations. You might call it completely harmless. But I think this is a severe bug. Leaking memory is better than a corrupted page table. Think security.

Would you provide any document and reference by saying" It is impossible to receive a signal during module load/unload"? For example, if the unload stuck in a lock, can CTRL+C stop the unload?

If "It is impossible to receive a signal during module load/unload", interruptible waiting is fine too, because function amdgpu_bo_reserve will return successfully.

What about there is some other return error? What about in future somebody improve amdgpu_bo_reserve to return other errors, then function amdgpu_vram_scratch_fini becomes buggy?

While I am thinking whether there is a better way for the current situation, would you give a real world example that my patch really not working? Then we can address it.

Thanks,

Alex Bin

________________________________
From: Christian König <deathsimple at vodafone.de>
Sent: Wednesday, April 19, 2017 2:35 AM
To: Xie, AlexBin; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi AlexBin,

the answer is ttm_bo_kunmap isn't called at all and yes in the case of an iomap we leak the address space reserved.

But that is completely harmless on a 64bit system compared to leaking the memory backing the address space.

Using amdgpu_bo_free_kernel() instead of openly coding it here is probably a good idea.

Additional to that it's probably a good idea to set the no_intr flag when reserving kernel BOs. It is impossible to receive a signal during module load/unload, but it's probably better to document that in the code as well.

Regards,
Christian.

Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
Hi Christian,

Have you found how/where/when? When you said "mapping will just be released a bit later on", you must know the answer.

It is difficult to prove something does not exist. Anyway, I will give it a try to prove such "later on" does not exist.

Function ttm_bo_kunmap is the only function to unmap. To prove this, search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to correctly kunmap.

Function ttm_bo_kunmap is not called by ttm itself. This is a hint that all TTM delay delete mechanism or unref mechanism will NOT kunmap BO later on.

Function ttm_bo_kunmap is called by AMDGPU function amdgpu_bo_kunmap and amdgpu_gem_prime_vunmap.

Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for scratch VRAM BO.  amdgpu_bo_free_kernel is a suspect but the answer is still NO.

So all possibilities are searched. Did I miss anything?

Thanks,
Alex Bin Xie

________________________________
From: Xie, AlexBin
Sent: Tuesday, April 18, 2017 2:04:33 PM
To: Christian König; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi Christian,

Would you point out where/when will kunmap happen for this BO when release? It must be somewhere in some function calls.

I checked before I asked for review. But I did not see such obvious kunmap function call.

If so, there should be a comment in function amdgpu_vram_scratch_fini to avoid future confusion.

Thanks,
Alex Bin Xie
________________________________
From: Christian König <deathsimple at vodafone.de><mailto:deathsimple at vodafone.de>
Sent: Tuesday, April 18, 2017 1:46 PM
To: Xie, AlexBin; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi AlexBin,

No, David is right. This is a very common coding pattern in the kernel module.

Freeing up a BO while there still exists a kernel mapping is perfectly ok, the mapping will just be released a bit later on.

So this code is actually perfectly ok and just an optimization, but your patch breaks it and creates a memory leak.

Regards,
Christian.

Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:

Hi David,

When amdgpu_bo_reserve return errors, we cannot release the BO. This is not a memory leak. General speaking, memory leak is unnoticed and unintentional.

The caller of function amdgpu_vram_scratch_fini ignores the return error value...

The "memory leak" is not caused by my patch. It is caused because reserving BO fails.

This patch only aim to make function amdgpu_vram_scratch_fini behave correctly.

To follow up, we can add a warning message when amdgpu_bo_reserve error happens in a different patch.

If function call amdgpu_bo_reserve is changed to uninterruptible, this changes driver behaviour. Without a substantial issue, I would be cautious for such a change.

Thanks,

Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Monday, April 17, 2017 10:38 PM
To: Xie, AlexBin; amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

On 2017年04月17日 22:54, Xie, AlexBin wrote:

Hi David,

Thanks for the comments. However, please have look at amdgpu_bo_reserve definition.

static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool no_intr)

Ah, this is a wired wrapper for ttm_bo_reserve.

When we call this function like the following:

         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
The false means interruptible.

On the other hand,  when amdgpu_bo_reserve function return error, why do we unref BO without kunmap and unpin the BO? This is wrong implementation when amdgpu_bo_reserve return any error.

Yeah, I see your mean, it's in driver un-loading, How about changing to no interruptible? Your patch will make a memleak if bo_reserve fails, but it seems not matter. I have no strong preference.

Regards,
David Zhou

Thanks,
Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Friday, April 14, 2017 12:00 AM
To: Xie, AlexBin; amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

On 2017年04月14日 05:34, Alex Xie wrote:
> According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
> can return with -ERESTARTSYS. When this function was interrupted
> by a signal, BO should not be unref. Otherwise the BO might be
> released while is kmapped and pinned, or BO MIGHT be deref
> multiple times, etc.
         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
we have specified interruptible to false, so -ERESTARTSYS isn't possible
here.

Thanks,
David Zhou
>
> Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
> Signed-off-by: Alex Xie <AlexBin.Xie at amd.com><mailto:AlexBin.Xie at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 53996e3..1dcc2d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct amdgpu_device *adev)
>                amdgpu_bo_kunmap(adev->vram_scratch.robj);
>                amdgpu_bo_unpin(adev->vram_scratch.robj);
>                amdgpu_bo_unreserve(adev->vram_scratch.robj);
> +             amdgpu_bo_unref(&adev->vram_scratch.robj);
>        }
> -     amdgpu_bo_unref(&adev->vram_scratch.robj);
>   }
>
>   /**

_______________________________________________
amd-gfx mailing list
amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170419/8a25492f/attachment-0001.html>