[PATCH] drm/ttm: fix ttm tt init fail when size exceeds kmalloc limit

Christian König christian.koenig at amd.com
Wed Apr 20 12:56:50 UTC 2022


Am 20.04.22 um 14:54 schrieb Wang, Yang(Kevin):
>
> [AMD Official Use Only]
>
>
> Hi Chris,
>
> 1) Change the test case to use something larger than 1TiB.
> sure, we can increase the size of BO and make test pass,
> but if user really want to allocate 1TB GTT BO, we have no reason to 
> let it fail? right?

No, the reason is the underlying core kernel doesn't allow kvmalloc 
allocations with GFP_ZERO which are large enough to hold the array of 
allocated pages for this.

We are working on top of the core Linux kernel and should *NEVER* ever 
add workarounds like what was suggested here.

Regards,
Christian.

> the system availed memory about 2T, but it will still fail.
>
> 2) Change kvmalloc to allow GFP_ZERO allocations even in the vmalloc 
> fallback path.
>     the 5.18 kernel will add this patch to fix this issue .
>
> Best Regards,
> Kevin
> ------------------------------------------------------------------------
> *From:* Koenig, Christian <Christian.Koenig at amd.com>
> *Sent:* Wednesday, April 20, 2022 8:42 PM
> *To:* Wang, Yang(Kevin) <KevinYang.Wang at amd.com>; Christian König 
> <ckoenig.leichtzumerken at gmail.com>; dri-devel at lists.freedesktop.org 
> <dri-devel at lists.freedesktop.org>; amd-gfx at lists.freedesktop.org 
> <amd-gfx at lists.freedesktop.org>
> *Subject:* Re: [PATCH] drm/ttm: fix ttm tt init fail when size exceeds 
> kmalloc limit
> Hi Kevin,
>
> yes and that is perfectly valid and expected behavior. There is 
> absolutely no need to change anything in TTM here.
>
> What we could do is:
> 1) Change the test case to use something larger than 1TiB.
> 2) Change kvmalloc to allow GFP_ZERO allocations even in the vmalloc 
> fallback path.
>
> Regards,
> Christian.
>
> Am 20.04.22 um 14:39 schrieb Wang, Yang(Kevin):
>>
>> [AMD Official Use Only]
>>
>>
>> Hi Chirs,
>>
>> yes, right, the amdgpu drive rwill use amdgpu_bo_validate_size() 
>> function to verify bo size,
>> but when driver try to allocate VRAM domain bo fail, the amdgpu 
>> driver will fall back to allocate domain = (GTT | VRAM)  bo.
>> please check following code, it will cause the 2nd time to allocate 
>> bo fail during allocate 256Mb buffer to store dma address (via 
>> kvmalloc()).
>>
>> initial_domain = (u32)(0xffffffff & args->in.domains);
>> retry:
>>         r = amdgpu_gem_object_create(adev, size, args->in.alignment,
>>                    initial_domain,
>>                    flags, ttm_bo_type_device, resv, &gobj);
>>         if (r && r != -ERESTARTSYS) {
>>                 if (flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) {
>>       flags &= ~AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED;
>>       goto retry;
>>                 }
>>
>>                 if (initial_domain == AMDGPU_GEM_DOMAIN_VRAM) {
>>       initial_domain |= AMDGPU_GEM_DOMAIN_GTT;
>>       goto retry;
>>                 }
>> DRM_DEBUG("Failed to allocate GEM object (%llu, %d, %llu, %d)\n",
>>               size, initial_domain, args->in.alignment, r);
>>         }
>>
>> Best Regards,
>> Kevin
>>
>> ------------------------------------------------------------------------
>> *From:* Christian König <ckoenig.leichtzumerken at gmail.com> 
>> <mailto:ckoenig.leichtzumerken at gmail.com>
>> *Sent:* Wednesday, April 20, 2022 7:55 PM
>> *To:* Wang, Yang(Kevin) <KevinYang.Wang at amd.com> 
>> <mailto:KevinYang.Wang at amd.com>; Koenig, Christian 
>> <Christian.Koenig at amd.com> <mailto:Christian.Koenig at amd.com>; 
>> dri-devel at lists.freedesktop.org 
>> <mailto:dri-devel at lists.freedesktop.org> 
>> <dri-devel at lists.freedesktop.org> 
>> <mailto:dri-devel at lists.freedesktop.org>; 
>> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org> 
>> <amd-gfx at lists.freedesktop.org> <mailto:amd-gfx at lists.freedesktop.org>
>> *Subject:* Re: [PATCH] drm/ttm: fix ttm tt init fail when size 
>> exceeds kmalloc limit
>> Hi Kevin,
>>
>> no, the test case should already fail in amdgpu_bo_validate_size().
>>
>> If we have a system with 2TiB of memory where the test case could 
>> succeed then we should increase the requested size to something larger.
>>
>> And if the underlying core Linux kernel functions don't allow 
>> allocations as large as the requested one we should *NEVER* ever add 
>> workarounds like this.
>>
>> It is perfectly expected that this test case is not able to fulfill 
>> the desired allocation. That it fails during kvmalloc is unfortunate, 
>> but not a show stopper.
>>
>> Regards,
>> Christian.
>>
>>
>> Am 20.04.22 um 13:32 schrieb Wang, Yang(Kevin):
>>>
>>> [AMD Official Use Only]
>>>
>>>
>>> Hi Chris,
>>>
>>> you misunderstood background about this case.
>>>
>>> although we expect this test case to fail, it should fail at the 
>>> location where the Bo actual memory is actually allocated. now the 
>>> code logic will cause the failure to allocate memory to store DMA 
>>> address.
>>>
>>> e.g: the case is failed in 2TB system ram machine, it should be 
>>> allocated successful, but it is failed.
>>>
>>> allocate 1TB BO, the ttm should allocate 1TB/4k * 8 buffer to store 
>>> allocate result (page address), this should not be failed usually.
>>>
>>> There is a similar fix in upstream kernel 5.18, before this fix 
>>> entered the TTM code, this problem existed in TTM.
>>>
>>> kernel/git/torvalds/linux.git - Linux kernel source tree 
>>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fh%3Dv5.18-rc3%26id%3Da421ef303008b0ceee2cfc625c3246fa7654b0ca&data=05%7C01%7CKevinYang.Wang%40amd.com%7C2e9fd86a27d64a39432508da22c4b1f1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637860525454702938%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CiP9x3grwQ2aXFZPjpsAtwLCpBVeJufbGngy%2BtXLFbM%3D&reserved=0>
>>> mm: allow !GFP_KERNEL allocations for kvmalloc
>>>
>>> Best Regards,
>>> Kevin
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Koenig, Christian <Christian.Koenig at amd.com> 
>>> <mailto:Christian.Koenig at amd.com>
>>> *Sent:* Wednesday, April 20, 2022 6:53 PM
>>> *To:* Wang, Yang(Kevin) <KevinYang.Wang at amd.com> 
>>> <mailto:KevinYang.Wang at amd.com>; dri-devel at lists.freedesktop.org 
>>> <mailto:dri-devel at lists.freedesktop.org> 
>>> <dri-devel at lists.freedesktop.org> 
>>> <mailto:dri-devel at lists.freedesktop.org>; 
>>> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org> 
>>> <amd-gfx at lists.freedesktop.org> <mailto:amd-gfx at lists.freedesktop.org>
>>> *Subject:* Re: [PATCH] drm/ttm: fix ttm tt init fail when size 
>>> exceeds kmalloc limit
>>> Am 20.04.22 um 11:07 schrieb Wang, Yang(Kevin):
>>>>
>>>> [AMD Official Use Only]
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Koenig, Christian <Christian.Koenig at amd.com> 
>>>> <mailto:Christian.Koenig at amd.com>
>>>> *Sent:* Wednesday, April 20, 2022 5:00 PM
>>>> *To:* Wang, Yang(Kevin) <KevinYang.Wang at amd.com> 
>>>> <mailto:KevinYang.Wang at amd.com>; dri-devel at lists.freedesktop.org 
>>>> <mailto:dri-devel at lists.freedesktop.org> 
>>>> <dri-devel at lists.freedesktop.org> 
>>>> <mailto:dri-devel at lists.freedesktop.org>; 
>>>> amd-gfx at lists.freedesktop.org 
>>>> <mailto:amd-gfx at lists.freedesktop.org> 
>>>> <amd-gfx at lists.freedesktop.org> <mailto:amd-gfx at lists.freedesktop.org>
>>>> *Subject:* Re: [PATCH] drm/ttm: fix ttm tt init fail when size 
>>>> exceeds kmalloc limit
>>>> Am 20.04.22 um 10:56 schrieb Yang Wang:
>>>> > if the __GFP_ZERO is set, the kvmalloc() can't fallback to use 
>>>> vmalloc()
>>>>
>>>> Hui what? Why should kvmalloc() not be able to fallback to vmalloc()
>>>> when __GFP_ZERO is set?
>>>>
>>>> And even that is really the case then that sounds like a bug in 
>>>> kvmalloc().
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> [kevin]:
>>>> it is really test case from libdrm amdgpu test, which try to 
>>>> allocate a big BO which will cause ttm tt init fail.
>>>
>>>
>>> LOL! Guys, this test case is intended to fail!
>>> *
>>> *The test consists of allocating a buffer so ridiculous large that 
>>> it should never succeed and be rejected by the kernel driver.
>>>
>>> This patch here is a really clear NAK.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> it may be a kvmalloc() bug, and this patch can as a workaround 
>>>> in ttm before this issue fix.
>>>>
>>>> void *kvmalloc_node(size_t size, gfp_t flags, int node)
>>>> {
>>>> ...
>>>>       if ((flags & GFP_KERNEL) != GFP_KERNEL)
>>>>               return kmalloc_node(size, flags, node);
>>>> ...
>>>> }
>>>>
>>>> Best Regards,
>>>> Kevin
>>>>
>>>> > to allocate memory, when request size is exceeds kmalloc limit, 
>>>> it will
>>>> > cause allocate memory fail.
>>>> >
>>>> > e.g: when ttm want to create a BO with 1TB size, it maybe fail.
>>>> >
>>>> > Signed-off-by: Yang Wang <KevinYang.Wang at amd.com> 
>>>> <mailto:KevinYang.Wang at amd.com>
>>>> > ---
>>>> >   drivers/gpu/drm/ttm/ttm_tt.c | 14 +++++++++++---
>>>> >   1 file changed, 11 insertions(+), 3 deletions(-)
>>>> >
>>>> > diff --git a/drivers/gpu/drm/ttm/ttm_tt.c 
>>>> b/drivers/gpu/drm/ttm/ttm_tt.c
>>>> > index 79c870a3bef8..9f2f3e576b8d 100644
>>>> > --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>>> > +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>>> > @@ -97,9 +97,12 @@ int ttm_tt_create(struct ttm_buffer_object 
>>>> *bo, bool zero_alloc)
>>>> >   static int ttm_tt_alloc_page_directory(struct ttm_tt *ttm)
>>>> >   {
>>>> >        ttm->pages = kvmalloc_array(ttm->num_pages, sizeof(void*),
>>>> > -                     GFP_KERNEL | __GFP_ZERO);
>>>> > + GFP_KERNEL);
>>>> >        if (!ttm->pages)
>>>> >                return -ENOMEM;
>>>> > +
>>>> > +     memset(ttm->pages, 0, ttm->num_pages * sizeof(void *));
>>>> > +
>>>> >        return 0;
>>>> >   }
>>>> >
>>>> > @@ -108,10 +111,12 @@ static int 
>>>> ttm_dma_tt_alloc_page_directory(struct ttm_tt *ttm)
>>>> >        ttm->pages = kvmalloc_array(ttm->num_pages,
>>>> >                                    sizeof(*ttm->pages) +
>>>> >                                    sizeof(*ttm->dma_address),
>>>> > - GFP_KERNEL | __GFP_ZERO);
>>>> > + GFP_KERNEL);
>>>> >        if (!ttm->pages)
>>>> >                return -ENOMEM;
>>>> >
>>>> > +     memset(ttm->pages, 0, ttm->num_pages * (sizeof(*ttm->pages) 
>>>> + sizeof(*ttm->dma_address)));
>>>> > +
>>>> >        ttm->dma_address = (void *)(ttm->pages + ttm->num_pages);
>>>> >        return 0;
>>>> >   }
>>>> > @@ -120,9 +125,12 @@ static int 
>>>> ttm_sg_tt_alloc_page_directory(struct ttm_tt *ttm)
>>>> >   {
>>>> >        ttm->dma_address = kvmalloc_array(ttm->num_pages,
>>>> >                                         sizeof(*ttm->dma_address),
>>>> > - GFP_KERNEL | __GFP_ZERO);
>>>> > + GFP_KERNEL);
>>>> >        if (!ttm->dma_address)
>>>> >                return -ENOMEM;
>>>> > +
>>>> > +     memset(ttm->dma_address, 0, ttm->num_pages * 
>>>> sizeof(*ttm->dma_address));
>>>> > +
>>>> >        return 0;
>>>> >   }
>>>> >
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20220420/5b003fef/attachment-0001.htm>


More information about the amd-gfx mailing list