[PATCH] drm/ttm: fix ttm tt init fail when size exceeds kmalloc limit

Lazar, Lijo lijo.lazar at amd.com
Wed Apr 20 13:23:15 UTC 2022



On 4/20/2022 6:26 PM, Christian König wrote:
> Am 20.04.22 um 14:54 schrieb Wang, Yang(Kevin):
>>
>> [AMD Official Use Only]
>>
>>
>> Hi Chris,
>>
>> 1) Change the test case to use something larger than 1TiB.
>> sure, we can increase the size of BO and make test pass,
>> but if user really want to allocate 1TB GTT BO, we have no reason to 
>> let it fail? right?
> 
> No, the reason is the underlying core kernel doesn't allow kvmalloc 
> allocations with GFP_ZERO which are large enough to hold the array of 
> allocated pages for this.
> 
> We are working on top of the core Linux kernel and should *NEVER* ever 
> add workarounds like what was suggested here. >

AFAIU, for the purpose of ttm use, fallback to vmalloc is fine.

  * Please note that any use of gfp flags outside of GFP_KERNEL is 
careful to not
  * fall back to vmalloc.
  *

Actually the current implementation documents the behavior, but it is 
deep inside the implementation to be noticeable - at least not obvious 
while using kvmalloc_array.

Thanks,
Lijo

> Regards,
> Christian.
> 
>> the system availed memory about 2T, but it will still fail.
>>
>> 2) Change kvmalloc to allow GFP_ZERO allocations even in the vmalloc 
>> fallback path.
>>     the 5.18 kernel will add this patch to fix this issue .
>>
>> Best Regards,
>> Kevin
>> ------------------------------------------------------------------------
>> *From:* Koenig, Christian <Christian.Koenig at amd.com>
>> *Sent:* Wednesday, April 20, 2022 8:42 PM
>> *To:* Wang, Yang(Kevin) <KevinYang.Wang at amd.com>; Christian König 
>> <ckoenig.leichtzumerken at gmail.com>; dri-devel at lists.freedesktop.org 
>> <dri-devel at lists.freedesktop.org>; amd-gfx at lists.freedesktop.org 
>> <amd-gfx at lists.freedesktop.org>
>> *Subject:* Re: [PATCH] drm/ttm: fix ttm tt init fail when size exceeds 
>> kmalloc limit
>> Hi Kevin,
>>
>> yes and that is perfectly valid and expected behavior. There is 
>> absolutely no need to change anything in TTM here.
>>
>> What we could do is:
>> 1) Change the test case to use something larger than 1TiB.
>> 2) Change kvmalloc to allow GFP_ZERO allocations even in the vmalloc 
>> fallback path.
>>
>> Regards,
>> Christian.
>>
>> Am 20.04.22 um 14:39 schrieb Wang, Yang(Kevin):
>>>
>>> [AMD Official Use Only]
>>>
>>>
>>> Hi Chirs,
>>>
>>> yes, right, the amdgpu drive rwill use amdgpu_bo_validate_size() 
>>> function to verify bo size,
>>> but when driver try to allocate VRAM domain bo fail, the amdgpu 
>>> driver will fall back to allocate domain = (GTT | VRAM)  bo.
>>> please check following code, it will cause the 2nd time to allocate 
>>> bo fail during allocate 256Mb buffer to store dma address (via 
>>> kvmalloc()).
>>>
>>> initial_domain = (u32)(0xffffffff & args->in.domains);
>>> retry:
>>>         r = amdgpu_gem_object_create(adev, size, args->in.alignment,
>>>                    initial_domain,
>>>                    flags, ttm_bo_type_device, resv, &gobj);
>>>         if (r && r != -ERESTARTSYS) {
>>>                 if (flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) {
>>>       flags &= ~AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED;
>>>       goto retry;
>>>                 }
>>>
>>>                 if (initial_domain == AMDGPU_GEM_DOMAIN_VRAM) {
>>>       initial_domain |= AMDGPU_GEM_DOMAIN_GTT;
>>>       goto retry;
>>>                 }
>>> DRM_DEBUG("Failed to allocate GEM object (%llu, %d, %llu, %d)\n",
>>>               size, initial_domain, args->in.alignment, r);
>>>         }
>>>
>>> Best Regards,
>>> Kevin
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Christian König <ckoenig.leichtzumerken at gmail.com> 
>>> <mailto:ckoenig.leichtzumerken at gmail.com>
>>> *Sent:* Wednesday, April 20, 2022 7:55 PM
>>> *To:* Wang, Yang(Kevin) <KevinYang.Wang at amd.com> 
>>> <mailto:KevinYang.Wang at amd.com>; Koenig, Christian 
>>> <Christian.Koenig at amd.com> <mailto:Christian.Koenig at amd.com>; 
>>> dri-devel at lists.freedesktop.org 
>>> <mailto:dri-devel at lists.freedesktop.org> 
>>> <dri-devel at lists.freedesktop.org> 
>>> <mailto:dri-devel at lists.freedesktop.org>; 
>>> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org> 
>>> <amd-gfx at lists.freedesktop.org> <mailto:amd-gfx at lists.freedesktop.org>
>>> *Subject:* Re: [PATCH] drm/ttm: fix ttm tt init fail when size 
>>> exceeds kmalloc limit
>>> Hi Kevin,
>>>
>>> no, the test case should already fail in amdgpu_bo_validate_size().
>>>
>>> If we have a system with 2TiB of memory where the test case could 
>>> succeed then we should increase the requested size to something larger.
>>>
>>> And if the underlying core Linux kernel functions don't allow 
>>> allocations as large as the requested one we should *NEVER* ever add 
>>> workarounds like this.
>>>
>>> It is perfectly expected that this test case is not able to fulfill 
>>> the desired allocation. That it fails during kvmalloc is unfortunate, 
>>> but not a show stopper.
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>> Am 20.04.22 um 13:32 schrieb Wang, Yang(Kevin):
>>>>
>>>> [AMD Official Use Only]
>>>>
>>>>
>>>> Hi Chris,
>>>>
>>>> you misunderstood background about this case.
>>>>
>>>> although we expect this test case to fail, it should fail at the 
>>>> location where the Bo actual memory is actually allocated. now the 
>>>> code logic will cause the failure to allocate memory to store DMA 
>>>> address.
>>>>
>>>> e.g: the case is failed in 2TB system ram machine, it should be 
>>>> allocated successful, but it is failed.
>>>>
>>>> allocate 1TB BO, the ttm should allocate 1TB/4k * 8 buffer to store 
>>>> allocate result (page address), this should not be failed usually.
>>>>
>>>> There is a similar fix in upstream kernel 5.18, before this fix 
>>>> entered the TTM code, this problem existed in TTM.
>>>>
>>>> kernel/git/torvalds/linux.git - Linux kernel source tree 
>>>> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fh%3Dv5.18-rc3%26id%3Da421ef303008b0ceee2cfc625c3246fa7654b0ca&data=05%7C01%7Clijo.lazar%40amd.com%7C908fad4b756248e06e5e08da22cd4463%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637860562263637656%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=34hU%2FcxcRgBiiZ2jdNxTVmU7t4JF7TR27hx1b119U9g%3D&reserved=0>
>>>> mm: allow !GFP_KERNEL allocations for kvmalloc
>>>>
>>>> Best Regards,
>>>> Kevin
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Koenig, Christian <Christian.Koenig at amd.com> 
>>>> <mailto:Christian.Koenig at amd.com>
>>>> *Sent:* Wednesday, April 20, 2022 6:53 PM
>>>> *To:* Wang, Yang(Kevin) <KevinYang.Wang at amd.com> 
>>>> <mailto:KevinYang.Wang at amd.com>; dri-devel at lists.freedesktop.org 
>>>> <mailto:dri-devel at lists.freedesktop.org> 
>>>> <dri-devel at lists.freedesktop.org> 
>>>> <mailto:dri-devel at lists.freedesktop.org>; 
>>>> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org> 
>>>> <amd-gfx at lists.freedesktop.org> <mailto:amd-gfx at lists.freedesktop.org>
>>>> *Subject:* Re: [PATCH] drm/ttm: fix ttm tt init fail when size 
>>>> exceeds kmalloc limit
>>>> Am 20.04.22 um 11:07 schrieb Wang, Yang(Kevin):
>>>>>
>>>>> [AMD Official Use Only]
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:* Koenig, Christian <Christian.Koenig at amd.com> 
>>>>> <mailto:Christian.Koenig at amd.com>
>>>>> *Sent:* Wednesday, April 20, 2022 5:00 PM
>>>>> *To:* Wang, Yang(Kevin) <KevinYang.Wang at amd.com> 
>>>>> <mailto:KevinYang.Wang at amd.com>; dri-devel at lists.freedesktop.org 
>>>>> <mailto:dri-devel at lists.freedesktop.org> 
>>>>> <dri-devel at lists.freedesktop.org> 
>>>>> <mailto:dri-devel at lists.freedesktop.org>; 
>>>>> amd-gfx at lists.freedesktop.org 
>>>>> <mailto:amd-gfx at lists.freedesktop.org> 
>>>>> <amd-gfx at lists.freedesktop.org> <mailto:amd-gfx at lists.freedesktop.org>
>>>>> *Subject:* Re: [PATCH] drm/ttm: fix ttm tt init fail when size 
>>>>> exceeds kmalloc limit
>>>>> Am 20.04.22 um 10:56 schrieb Yang Wang:
>>>>> > if the __GFP_ZERO is set, the kvmalloc() can't fallback to use 
>>>>> vmalloc()
>>>>>
>>>>> Hui what? Why should kvmalloc() not be able to fallback to vmalloc()
>>>>> when __GFP_ZERO is set?
>>>>>
>>>>> And even that is really the case then that sounds like a bug in 
>>>>> kvmalloc().
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> [kevin]:
>>>>> it is really test case from libdrm amdgpu test, which try to 
>>>>> allocate a big BO which will cause ttm tt init fail.
>>>>
>>>>
>>>> LOL! Guys, this test case is intended to fail!
>>>> *
>>>> *The test consists of allocating a buffer so ridiculous large that 
>>>> it should never succeed and be rejected by the kernel driver.
>>>>
>>>> This patch here is a really clear NAK.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> it may be a kvmalloc() bug, and this patch can as a workaround 
>>>>> in ttm before this issue fix.
>>>>>
>>>>> void *kvmalloc_node(size_t size, gfp_t flags, int node)
>>>>> {
>>>>> ...
>>>>>       if ((flags & GFP_KERNEL) != GFP_KERNEL)
>>>>>               return kmalloc_node(size, flags, node);
>>>>> ...
>>>>> }
>>>>>
>>>>> Best Regards,
>>>>> Kevin
>>>>>
>>>>> > to allocate memory, when request size is exceeds kmalloc limit, 
>>>>> it will
>>>>> > cause allocate memory fail.
>>>>> >
>>>>> > e.g: when ttm want to create a BO with 1TB size, it maybe fail.
>>>>> >
>>>>> > Signed-off-by: Yang Wang <KevinYang.Wang at amd.com> 
>>>>> <mailto:KevinYang.Wang at amd.com>
>>>>> > ---
>>>>> >   drivers/gpu/drm/ttm/ttm_tt.c | 14 +++++++++++---
>>>>> >   1 file changed, 11 insertions(+), 3 deletions(-)
>>>>> >
>>>>> > diff --git a/drivers/gpu/drm/ttm/ttm_tt.c 
>>>>> b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>> > index 79c870a3bef8..9f2f3e576b8d 100644
>>>>> > --- a/drivers/gpu/drm/ttm/ttm_tt.c
>>>>> > +++ b/drivers/gpu/drm/ttm/ttm_tt.c
>>>>> > @@ -97,9 +97,12 @@ int ttm_tt_create(struct ttm_buffer_object 
>>>>> *bo, bool zero_alloc)
>>>>> >   static int ttm_tt_alloc_page_directory(struct ttm_tt *ttm)
>>>>> >   {
>>>>> >        ttm->pages = kvmalloc_array(ttm->num_pages, sizeof(void*),
>>>>> > -                     GFP_KERNEL | __GFP_ZERO);
>>>>> > + GFP_KERNEL);
>>>>> >        if (!ttm->pages)
>>>>> >                return -ENOMEM;
>>>>> > +
>>>>> > +     memset(ttm->pages, 0, ttm->num_pages * sizeof(void *));
>>>>> > +
>>>>> >        return 0;
>>>>> >   }
>>>>> >
>>>>> > @@ -108,10 +111,12 @@ static int 
>>>>> ttm_dma_tt_alloc_page_directory(struct ttm_tt *ttm)
>>>>> >        ttm->pages = kvmalloc_array(ttm->num_pages,
>>>>> >                                    sizeof(*ttm->pages) +
>>>>> >                                    sizeof(*ttm->dma_address),
>>>>> > - GFP_KERNEL | __GFP_ZERO);
>>>>> > + GFP_KERNEL);
>>>>> >        if (!ttm->pages)
>>>>> >                return -ENOMEM;
>>>>> >
>>>>> > +     memset(ttm->pages, 0, ttm->num_pages * (sizeof(*ttm->pages) 
>>>>> + sizeof(*ttm->dma_address)));
>>>>> > +
>>>>> >        ttm->dma_address = (void *)(ttm->pages + ttm->num_pages);
>>>>> >        return 0;
>>>>> >   }
>>>>> > @@ -120,9 +125,12 @@ static int 
>>>>> ttm_sg_tt_alloc_page_directory(struct ttm_tt *ttm)
>>>>> >   {
>>>>> >        ttm->dma_address = kvmalloc_array(ttm->num_pages,
>>>>> >                                         sizeof(*ttm->dma_address),
>>>>> > - GFP_KERNEL | __GFP_ZERO);
>>>>> > + GFP_KERNEL);
>>>>> >        if (!ttm->dma_address)
>>>>> >                return -ENOMEM;
>>>>> > +
>>>>> > +     memset(ttm->dma_address, 0, ttm->num_pages * 
>>>>> sizeof(*ttm->dma_address));
>>>>> > +
>>>>> >        return 0;
>>>>> >   }
>>>>> >
>>>>>
>>>>
>>>
>>
> 


More information about the amd-gfx mailing list