[PATCH] drm/ttm: fix ttm tt init fail when size exceeds kmalloc limit

Christian König christian.koenig at amd.com
Wed Apr 20 12:42:17 UTC 2022


Hi Kevin,

yes and that is perfectly valid and expected behavior. There is 
absolutely no need to change anything in TTM here.

What we could do is:
1) Change the test case to use something larger than 1TiB.
2) Change kvmalloc to allow GFP_ZERO allocations even in the vmalloc 
fallback path.

Regards,
Christian.

Am 20.04.22 um 14:39 schrieb Wang, Yang(Kevin):
>
> [AMD Official Use Only]
>
>
> Hi Chirs,
>
> yes, right, the amdgpu drive rwill use amdgpu_bo_validate_size() 
> function to verify bo size,
> but when driver try to allocate VRAM domain bo fail, the amdgpu driver 
> will fall back to allocate domain = (GTT | VRAM)  bo.
> please check following code, it will cause the 2nd time to allocate bo 
> fail during allocate 256Mb buffer to store dma address (via kvmalloc()).
>
>   initial_domain = (u32)(0xffffffff & args->in.domains);
> retry:
>         r = amdgpu_gem_object_create(adev, size, args->in.alignment,
>                  initial_domain,
>                  flags, ttm_bo_type_device, resv, &gobj);
>         if (r && r != -ERESTARTSYS) {
>                 if (flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) {
>     flags &= ~AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED;
>     goto retry;
>                 }
>
>                 if (initial_domain == AMDGPU_GEM_DOMAIN_VRAM) {
>     initial_domain |= AMDGPU_GEM_DOMAIN_GTT;
>     goto retry;
>                 }
> DRM_DEBUG("Failed to allocate GEM object (%llu, %d, %llu, %d)\n",
>             size, initial_domain, args->in.alignment, r);
>   }
>
> Best Regards,
> Kevin
>
> ------------------------------------------------------------------------
> *From:* Christian König <ckoenig.leichtzumerken at gmail.com>
> *Sent:* Wednesday, April 20, 2022 7:55 PM
> *To:* Wang, Yang(Kevin) <KevinYang.Wang at amd.com>; Koenig, Christian 
> <Christian.Koenig at amd.com>; dri-devel at lists.freedesktop.org 
> <dri-devel at lists.freedesktop.org>; amd-gfx at lists.freedesktop.org 
> <amd-gfx at lists.freedesktop.org>
> *Subject:* Re: [PATCH] drm/ttm: fix ttm tt init fail when size exceeds 
> kmalloc limit
> Hi Kevin,
>
> no, the test case should already fail in amdgpu_bo_validate_size().
>
> If we have a system with 2TiB of memory where the test case could 
> succeed then we should increase the requested size to something larger.
>
> And if the underlying core Linux kernel functions don't allow 
> allocations as large as the requested one we should *NEVER* ever add 
> workarounds like this.
>
> It is perfectly expected that this test case is not able to fulfill 
> the desired allocation. That it fails during kvmalloc is unfortunate, 
> but not a show stopper.
>
> Regards,
> Christian.
>
>
> Am 20.04.22 um 13:32 schrieb Wang, Yang(Kevin):
>>
>> [AMD Official Use Only]
>>
>>
>> Hi Chris,
>>
>> you misunderstood background about this case.
>>
>> although we expect this test case to fail, it should fail at the 
>> location where the Bo actual memory is actually allocated. now the 
>> code logic will cause the failure to allocate memory to store DMA 
>> address.
>>
>> e.g: the case is failed in 2TB system ram machine, it should be 
>> allocated successful, but it is failed.
>>
>> allocate 1TB BO, the ttm should allocate 1TB/4k * 8 buffer to store 
>> allocate result (page address), this should not be failed usually.
>>
>> There is a similar fix in upstream kernel 5.18, before this fix 
>> entered the TTM code, this problem existed in TTM.
>>
>> kernel/git/torvalds/linux.git - Linux kernel source tree 
>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fh%3Dv5.18-rc3%26id%3Da421ef303008b0ceee2cfc625c3246fa7654b0ca&data=05%7C01%7CKevinYang.Wang%40amd.com%7C2e9fd86a27d64a39432508da22c4b1f1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637860525454702938%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CiP9x3grwQ2aXFZPjpsAtwLCpBVeJufbGngy%2BtXLFbM%3D&reserved=0>
>> mm: allow !GFP_KERNEL allocations for kvmalloc
>>
>> Best Regards,
>> Kevin
>>
>> ------------------------------------------------------------------------
>> *From:* Koenig, Christian <Christian.Koenig at amd.com> 
>> <mailto:Christian.Koenig at amd.com>
>> *Sent:* Wednesday, April 20, 2022 6:53 PM
>> *To:* Wang, Yang(Kevin) <KevinYang.Wang at amd.com> 
>> <mailto:KevinYang.Wang at amd.com>; dri-devel at lists.freedesktop.org 
>> <mailto:dri-devel at lists.freedesktop.org> 
>> <dri-devel at lists.freedesktop.org> 
>> <mailto:dri-devel at lists.freedesktop.org>; 
>> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org> 
>> <amd-gfx at lists.freedesktop.org> <mailto:amd-gfx at lists.freedesktop.org>
>> *Subject:* Re: [PATCH] drm/ttm: fix ttm tt init fail when size 
>> exceeds kmalloc limit
>> Am 20.04.22 um 11:07 schrieb Wang, Yang(Kevin):
>>>
>>> [AMD Official Use Only]
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Koenig, Christian <Christian.Koenig at amd.com> 
>>> <mailto:Christian.Koenig at amd.com>
>>> *Sent:* Wednesday, April 20, 2022 5:00 PM
>>> *To:* Wang, Yang(Kevin) <KevinYang.Wang at amd.com> 
>>> <mailto:KevinYang.Wang at amd.com>; dri-devel at lists.freedesktop.org 
>>> <mailto:dri-devel at lists.freedesktop.org> 
>>> <dri-devel at lists.freedesktop.org> 
>>> <mailto:dri-devel at lists.freedesktop.org>; 
>>> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org> 
>>> <amd-gfx at lists.freedesktop.org> <mailto:amd-gfx at lists.freedesktop.org>
>>> *Subject:* Re: [PATCH] drm/ttm: fix ttm tt init fail when size 
>>> exceeds kmalloc limit
>>> Am 20.04.22 um 10:56 schrieb Yang Wang:
>>> > if the __GFP_ZERO is set, the kvmalloc() can't fallback to use 
>>> vmalloc()
>>>
>>> Hui what? Why should kvmalloc() not be able to fallback to vmalloc()
>>> when __GFP_ZERO is set?
>>>
>>> And even that is really the case then that sounds like a bug in 
>>> kvmalloc().
>>>
>>> Regards,
>>> Christian.
>>>
>>> [kevin]:
>>> it is really test case from libdrm amdgpu test, which try to 
>>> allocate a big BO which will cause ttm tt init fail.
>>
>>
>> LOL! Guys, this test case is intended to fail!
>> *
>> *The test consists of allocating a buffer so ridiculous large that it 
>> should never succeed and be rejected by the kernel driver.
>>
>> This patch here is a really clear NAK.
>>
>> Regards,
>> Christian.
>>
>>> it may be a kvmalloc() bug, and this patch can as a workaround 
>>> in ttm before this issue fix.
>>>
>>> void *kvmalloc_node(size_t size, gfp_t flags, int node)
>>> {
>>> ...
>>> if ((flags & GFP_KERNEL) != GFP_KERNEL)
>>>         return kmalloc_node(size, flags, node);
>>> ...
>>> }
>>>
>>> Best Regards,
>>> Kevin
>>>
>>> > to allocate memory, when request size is exceeds kmalloc limit, it 
>>> will
>>> > cause allocate memory fail.
>>> >
>>> > e.g: when ttm want to create a BO with 1TB size, it maybe fail.
>>> >
>>> > Signed-off-by: Yang Wang <KevinYang.Wang at amd.com> 
>>> <mailto:KevinYang.Wang at amd.com>
>>> > ---
>>> >   drivers/gpu/drm/ttm/ttm_tt.c | 14 +++++++++++---
>>> >   1 file changed, 11 insertions(+), 3 deletions(-)
>>> >
>>> > diff --git a/drivers/gpu/drm/ttm/ttm_tt.c 
>>> b/drivers/gpu/drm/ttm/ttm_tt.c
>>> > index 79c870a3bef8..9f2f3e576b8d 100644
>>> > --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>> > +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>> > @@ -97,9 +97,12 @@ int ttm_tt_create(struct ttm_buffer_object *bo, 
>>> bool zero_alloc)
>>> >   static int ttm_tt_alloc_page_directory(struct ttm_tt *ttm)
>>> >   {
>>> >        ttm->pages = kvmalloc_array(ttm->num_pages, sizeof(void*),
>>> > -                     GFP_KERNEL | __GFP_ZERO);
>>> > + GFP_KERNEL);
>>> >        if (!ttm->pages)
>>> >                return -ENOMEM;
>>> > +
>>> > +     memset(ttm->pages, 0, ttm->num_pages * sizeof(void *));
>>> > +
>>> >        return 0;
>>> >   }
>>> >
>>> > @@ -108,10 +111,12 @@ static int 
>>> ttm_dma_tt_alloc_page_directory(struct ttm_tt *ttm)
>>> >        ttm->pages = kvmalloc_array(ttm->num_pages,
>>> > sizeof(*ttm->pages) +
>>> > sizeof(*ttm->dma_address),
>>> > - GFP_KERNEL | __GFP_ZERO);
>>> > + GFP_KERNEL);
>>> >        if (!ttm->pages)
>>> >                return -ENOMEM;
>>> >
>>> > +     memset(ttm->pages, 0, ttm->num_pages * (sizeof(*ttm->pages) 
>>> + sizeof(*ttm->dma_address)));
>>> > +
>>> >        ttm->dma_address = (void *)(ttm->pages + ttm->num_pages);
>>> >        return 0;
>>> >   }
>>> > @@ -120,9 +125,12 @@ static int 
>>> ttm_sg_tt_alloc_page_directory(struct ttm_tt *ttm)
>>> >   {
>>> >        ttm->dma_address = kvmalloc_array(ttm->num_pages,
>>> >                                         sizeof(*ttm->dma_address),
>>> > - GFP_KERNEL | __GFP_ZERO);
>>> > + GFP_KERNEL);
>>> >        if (!ttm->dma_address)
>>> >                return -ENOMEM;
>>> > +
>>> > +     memset(ttm->dma_address, 0, ttm->num_pages * 
>>> sizeof(*ttm->dma_address));
>>> > +
>>> >        return 0;
>>> >   }
>>> >
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220420/45e69f60/attachment-0001.htm>


More information about the amd-gfx mailing list