Failed to find memory space for buffer eviction
Felix Kuehling
felix.kuehling at amd.com
Wed Jul 15 15:14:01 UTC 2020
Am 2020-07-15 um 5:28 a.m. schrieb Christian König:
> Am 15.07.20 um 04:49 schrieb Felix Kuehling:
>> Am 2020-07-14 um 4:28 a.m. schrieb Christian König:
>>> Hi Felix,
>>>
>>> yes I already stumbled over this as well quite recently.
>>>
>>> See the following patch which I pushed to drm-misc-next just yesterday:
>>>
>>> commit e04be2310b5eac683ec03b096c0e22c4c2e23593
>>> Author: Christian König <christian.koenig at amd.com>
>>> Date: Mon Jul 6 17:32:55 2020 +0200
>>>
>>> drm/ttm: further cleanup ttm_mem_reg handling
>>>
>>> Stop touching the backend private pointer alltogether and
>>> make sure we never put the same mem twice by.
>>>
>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>> Reviewed-by: Madhav Chauhan <madhav.chauhan at amd.com>
>>> Link:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F375613%2F&data=02%7C01%7Cfelix.kuehling%40amd.com%7Cd859556fb0f04658081208d828a16797%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637304020992423068&sdata=Dpno3Wmqgyb%2FkRWzoye9T3tBg8BEgCXM0THGw8pKESY%3D&reserved=0
>>>
>>>
>>> But this shouldn't have been problematic since we used a dummy value
>>> for mem->mm_node in this case.
>> Hmm, yeah, I was reading the code wrong. It's possible that I was really
>> just out of GTT space. But see below.
>
> It looks like it yes.
I checked. I don't see a general GTT space leak. During the eviction
test the GTT usage spikes, but after finishing the test, GTT usage goes
back down to 7MB.
>
>>> What could be problematic and result is an overrun is that TTM was
>>> buggy and called put_node twice for the same memory.
>>>
>>> So I've seen that the code needs fixing as well, but I'm not 100% sure
>>> how you ran into your problem.
>> This is in the KFD eviction test, which deliberately overcommits VRAM in
>> order to trigger lots of evictions. It will use some GTT space while BOs
>> are evicted. But shouldn't it move them further out of GTT and into
>> SYSTEM to free up GTT space?
>
> Yes, exactly that should happen.
>
> But for some reason it couldn't find a candidate to evict and the
> 14371 pages left are just a bit to small for the buffer.
That would be a nested eviction. A VRAM to GTT eviction requires a GTT
to SYSTEM eviction to make space in GTT. Is that even possible?
Regards,
Felix
>
> Regards,
> Christian.
>
>> Your change "further cleanup ttm_mem_reg handling" removes a
>> mem->mm_node = NULL in ttm_bo_handle_move_mem in exactly the case where
>> a BO is moved from GTT to SYSTEM. I think that leads to a later put_node
>> call not happening or amdgpu_gtt_mgr_del returning before incrementing
>> mgr->available.
>>
>> I can try if cherry-picking your two fixes will help with the
>> eviction test.
>>
>> Regards,
>> Felix
>>
>>
>>> Regards,
>>> Christian.
>>>
>>> Am 14.07.20 um 02:44 schrieb Felix Kuehling:
>>>> I'm running into this problem with the KFD EvictionTest. The log
>>>> snippet
>>>> below looks like it ran out of GTT space for the eviction of a 64MB
>>>> buffer. But then it dumps the used and free space and shows plenty of
>>>> free space.
>>>>
>>>> As I understand it, the per-page breakdown of used and free space
>>>> shown
>>>> by TTM is the GART space. So it's not very meaningful.
>>>>
>>>> What matters more is the GTT space managed by amdgpu_gtt_mgr.c. And
>>>> that's where the problem is. It keeps track of available GTT space
>>>> with
>>>> an atomic counter in amdgpu_gtt_mgr.available. It gets decremented in
>>>> amdgpu_gtt_mgr_new and incremented in amdgpu_gtt_mgr_del. The trouble
>>>> is, that TTM doesn't call the latter for ttm_mem_regs that don't
>>>> have an
>>>> mm_node:
>>>>
>>>>> void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct ttm_mem_reg
>>>>> *mem)
>>>>> {
>>>>> struct ttm_mem_type_manager *man =
>>>>> &bo->bdev->man[mem->mem_type];
>>>>>
>>>>> if (mem->mm_node)
>>>>> (*man->func->put_node)(man, mem);
>>>>> }
>>>> GTT BOs that don't have GART space allocated, don't hate an
>>>> mm_node. So
>>>> the amdgpu_gtt_mgr.available counter doesn't get incremented when an
>>>> unmapped GTT BO is freed, and eventually runs out of space.
>>>>
>>>> Now I know what the problem is, but I don't know how to fix it.
>>>> Maybe a
>>>> dummy-mm_node for unmapped GTT BOs, to trick TTM into calling our
>>>> put_node callback? Or a change in TTM to call put_node
>>>> unconditionally?
>>>>
>>>> Regards,
>>>> Felix
>>>>
>>>>
>>>> [ 360.082552] [TTM] Failed to find memory space for buffer
>>>> 0x00000000264c823c eviction
>>>> [ 360.090331] [TTM] No space for 00000000264c823c (16384 pages,
>>>> 65536K, 64M)
>>>> [ 360.090334] [TTM] placement[0]=0x00010002 (1)
>>>> [ 360.090336] [TTM] has_type: 1
>>>> [ 360.090337] [TTM] use_type: 1
>>>> [ 360.090339] [TTM] flags: 0x0000000A
>>>> [ 360.090341] [TTM] gpu_offset: 0xFF00000000
>>>> [ 360.090342] [TTM] size: 1048576
>>>> [ 360.090344] [TTM] available_caching: 0x00070000
>>>> [ 360.090346] [TTM] default_caching: 0x00010000
>>>> [ 360.090349] [TTM] 0x0000000000000400-0x0000000000000402: 2: used
>>>> [ 360.090352] [TTM] 0x0000000000000402-0x0000000000000404: 2: used
>>>> [ 360.090354] [TTM] 0x0000000000000404-0x0000000000000406: 2: used
>>>> [ 360.090355] [TTM] 0x0000000000000406-0x0000000000000408: 2: used
>>>> [ 360.090357] [TTM] 0x0000000000000408-0x000000000000040a: 2: used
>>>> [ 360.090359] [TTM] 0x000000000000040a-0x000000000000040c: 2: used
>>>> [ 360.090361] [TTM] 0x000000000000040c-0x000000000000040e: 2: used
>>>> [ 360.090363] [TTM] 0x000000000000040e-0x0000000000000410: 2: used
>>>> [ 360.090365] [TTM] 0x0000000000000410-0x0000000000000412: 2: used
>>>> [ 360.090367] [TTM] 0x0000000000000412-0x0000000000000414: 2: used
>>>> [ 360.090368] [TTM] 0x0000000000000414-0x0000000000000415: 1: used
>>>> [ 360.090370] [TTM] 0x0000000000000415-0x0000000000000515: 256: used
>>>> [ 360.090372] [TTM] 0x0000000000000515-0x0000000000000516: 1: used
>>>> [ 360.090374] [TTM] 0x0000000000000516-0x0000000000000517: 1: used
>>>> [ 360.090376] [TTM] 0x0000000000000517-0x0000000000000518: 1: used
>>>> [ 360.090378] [TTM] 0x0000000000000518-0x0000000000000519: 1: used
>>>> [ 360.090379] [TTM] 0x0000000000000519-0x000000000000051a: 1: used
>>>> [ 360.090381] [TTM] 0x000000000000051a-0x000000000000051b: 1: used
>>>> [ 360.090383] [TTM] 0x000000000000051b-0x000000000000051c: 1: used
>>>> [ 360.090385] [TTM] 0x000000000000051c-0x000000000000051d: 1: used
>>>> [ 360.090387] [TTM] 0x000000000000051d-0x000000000000051f: 2: used
>>>> [ 360.090389] [TTM] 0x000000000000051f-0x0000000000000521: 2: used
>>>> [ 360.090391] [TTM] 0x0000000000000521-0x0000000000000522: 1: used
>>>> [ 360.090392] [TTM] 0x0000000000000522-0x0000000000000523: 1: used
>>>> [ 360.090394] [TTM] 0x0000000000000523-0x0000000000000524: 1: used
>>>> [ 360.090396] [TTM] 0x0000000000000524-0x0000000000000525: 1: used
>>>> [ 360.090398] [TTM] 0x0000000000000525-0x0000000000000625: 256: used
>>>> [ 360.090400] [TTM] 0x0000000000000625-0x0000000000000725: 256: used
>>>> [ 360.090402] [TTM] 0x0000000000000725-0x0000000000000727: 2: used
>>>> [ 360.090404] [TTM] 0x0000000000000727-0x00000000000007c0: 153: used
>>>> [ 360.090406] [TTM] 0x00000000000007c0-0x0000000000000b8a: 970: used
>>>> [ 360.090407] [TTM] 0x0000000000000b8a-0x0000000000000b8b: 1: used
>>>> [ 360.090409] [TTM] 0x0000000000000b8b-0x0000000000000bcb: 64: used
>>>> [ 360.090411] [TTM] 0x0000000000000bcb-0x0000000000000bcd: 2: used
>>>> [ 360.090413] [TTM] 0x0000000000000bcd-0x0000000000040000: 259123:
>>>> free
>>>> [ 360.090415] [TTM] total: 261120, used 1997 free 259123
>>>> [ 360.090417] [TTM] man size:1048576 pages, gtt available:14371
>>>> pages,
>>>> usage:4039MB
>>>>
>>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cfelix.kuehling%40amd.com%7Cd859556fb0f04658081208d828a16797%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637304020992423068&sdata=DTQpd9F8ST2i1VR9N4oCUfd88FimI4wShTvC%2BeR2ZSE%3D&reserved=0
>>
>
More information about the amd-gfx
mailing list