Failed to find memory space for buffer eviction

Christian König christian.koenig at amd.com
Tue Jul 14 08:28:49 UTC 2020


Hi Felix,

yes I already stumbled over this as well quite recently.

See the following patch which I pushed to drm-misc-next just yesterday:

commit e04be2310b5eac683ec03b096c0e22c4c2e23593
Author: Christian König <christian.koenig at amd.com>
Date:   Mon Jul 6 17:32:55 2020 +0200

     drm/ttm: further cleanup ttm_mem_reg handling

     Stop touching the backend private pointer alltogether and
     make sure we never put the same mem twice by.

     Signed-off-by: Christian König <christian.koenig at amd.com>
     Reviewed-by: Madhav Chauhan <madhav.chauhan at amd.com>
     Link: https://patchwork.freedesktop.org/patch/375613/


But this shouldn't have been problematic since we used a dummy value for 
mem->mm_node in this case.

What could be problematic and result is an overrun is that TTM was buggy 
and called put_node twice for the same memory.

So I've seen that the code needs fixing as well, but I'm not 100% sure 
how you ran into your problem.

Regards,
Christian.

Am 14.07.20 um 02:44 schrieb Felix Kuehling:
> I'm running into this problem with the KFD EvictionTest. The log snippet
> below looks like it ran out of GTT space for the eviction of a 64MB
> buffer. But then it dumps the used and free space and shows plenty of
> free space.
>
> As I understand it, the per-page breakdown of used and free space shown
> by TTM is the GART space. So it's not very meaningful.
>
> What matters more is the GTT space managed by amdgpu_gtt_mgr.c. And
> that's where the problem is. It keeps track of available GTT space with
> an atomic counter in amdgpu_gtt_mgr.available. It gets decremented in
> amdgpu_gtt_mgr_new and incremented in amdgpu_gtt_mgr_del. The trouble
> is, that TTM doesn't call the latter for ttm_mem_regs that don't have an
> mm_node:
>
>> void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct ttm_mem_reg *mem)
>> {
>>          struct ttm_mem_type_manager *man = &bo->bdev->man[mem->mem_type];
>>
>>          if (mem->mm_node)
>>                  (*man->func->put_node)(man, mem);
>> }
> GTT BOs that don't have GART space allocated, don't hate an mm_node. So
> the amdgpu_gtt_mgr.available counter doesn't get incremented when an
> unmapped GTT BO is freed, and eventually runs out of space.
>
> Now I know what the problem is, but I don't know how to fix it. Maybe a
> dummy-mm_node for unmapped GTT BOs, to trick TTM into calling our
> put_node callback? Or a change in TTM to call put_node unconditionally?
>
> Regards,
>    Felix
>
>
> [  360.082552] [TTM] Failed to find memory space for buffer
> 0x00000000264c823c eviction
> [  360.090331] [TTM]  No space for 00000000264c823c (16384 pages,
> 65536K, 64M)
> [  360.090334] [TTM]    placement[0]=0x00010002 (1)
> [  360.090336] [TTM]      has_type: 1
> [  360.090337] [TTM]      use_type: 1
> [  360.090339] [TTM]      flags: 0x0000000A
> [  360.090341] [TTM]      gpu_offset: 0xFF00000000
> [  360.090342] [TTM]      size: 1048576
> [  360.090344] [TTM]      available_caching: 0x00070000
> [  360.090346] [TTM]      default_caching: 0x00010000
> [  360.090349] [TTM]  0x0000000000000400-0x0000000000000402: 2: used
> [  360.090352] [TTM]  0x0000000000000402-0x0000000000000404: 2: used
> [  360.090354] [TTM]  0x0000000000000404-0x0000000000000406: 2: used
> [  360.090355] [TTM]  0x0000000000000406-0x0000000000000408: 2: used
> [  360.090357] [TTM]  0x0000000000000408-0x000000000000040a: 2: used
> [  360.090359] [TTM]  0x000000000000040a-0x000000000000040c: 2: used
> [  360.090361] [TTM]  0x000000000000040c-0x000000000000040e: 2: used
> [  360.090363] [TTM]  0x000000000000040e-0x0000000000000410: 2: used
> [  360.090365] [TTM]  0x0000000000000410-0x0000000000000412: 2: used
> [  360.090367] [TTM]  0x0000000000000412-0x0000000000000414: 2: used
> [  360.090368] [TTM]  0x0000000000000414-0x0000000000000415: 1: used
> [  360.090370] [TTM]  0x0000000000000415-0x0000000000000515: 256: used
> [  360.090372] [TTM]  0x0000000000000515-0x0000000000000516: 1: used
> [  360.090374] [TTM]  0x0000000000000516-0x0000000000000517: 1: used
> [  360.090376] [TTM]  0x0000000000000517-0x0000000000000518: 1: used
> [  360.090378] [TTM]  0x0000000000000518-0x0000000000000519: 1: used
> [  360.090379] [TTM]  0x0000000000000519-0x000000000000051a: 1: used
> [  360.090381] [TTM]  0x000000000000051a-0x000000000000051b: 1: used
> [  360.090383] [TTM]  0x000000000000051b-0x000000000000051c: 1: used
> [  360.090385] [TTM]  0x000000000000051c-0x000000000000051d: 1: used
> [  360.090387] [TTM]  0x000000000000051d-0x000000000000051f: 2: used
> [  360.090389] [TTM]  0x000000000000051f-0x0000000000000521: 2: used
> [  360.090391] [TTM]  0x0000000000000521-0x0000000000000522: 1: used
> [  360.090392] [TTM]  0x0000000000000522-0x0000000000000523: 1: used
> [  360.090394] [TTM]  0x0000000000000523-0x0000000000000524: 1: used
> [  360.090396] [TTM]  0x0000000000000524-0x0000000000000525: 1: used
> [  360.090398] [TTM]  0x0000000000000525-0x0000000000000625: 256: used
> [  360.090400] [TTM]  0x0000000000000625-0x0000000000000725: 256: used
> [  360.090402] [TTM]  0x0000000000000725-0x0000000000000727: 2: used
> [  360.090404] [TTM]  0x0000000000000727-0x00000000000007c0: 153: used
> [  360.090406] [TTM]  0x00000000000007c0-0x0000000000000b8a: 970: used
> [  360.090407] [TTM]  0x0000000000000b8a-0x0000000000000b8b: 1: used
> [  360.090409] [TTM]  0x0000000000000b8b-0x0000000000000bcb: 64: used
> [  360.090411] [TTM]  0x0000000000000bcb-0x0000000000000bcd: 2: used
> [  360.090413] [TTM]  0x0000000000000bcd-0x0000000000040000: 259123: free
> [  360.090415] [TTM]  total: 261120, used 1997 free 259123
> [  360.090417] [TTM]  man size:1048576 pages, gtt available:14371 pages,
> usage:4039MB
>
>



More information about the amd-gfx mailing list