[infinite loop] need some info about TTM's buffer eviction mechanism

Christian König deathsimple at vodafone.de
Tue Mar 14 16:40:26 UTC 2017


> And then try again (until ?).
The LRU is empty.

See you got one LRU per domain, so while evicting the buffer from VRAM 
it is moved to the GTT domain and also removed from the LRU domain.

When no other task is trying to do a CS the LRU will sooner or later 
become empty.

One possibility what happens here is that another process/thread is 
moving buffers back in while the first process is trying to evict them.

Regards,
Christian.

Am 14.03.2017 um 17:31 schrieb Julien Isorce:
> Hello,
>
> While debugging a softlock that happens on an ioctl(RADEON_CS), I 
> found that it keeps looping indefinitely in the following loop:
> https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L819
>
> That would be great if someone could explain the logic behind this 
> loop iteration. My understanding is that it tries to get a free node 
> to put the current buffer object by calling "ttm_bo_man_get_node". If 
> it fails with mem->mm_node as NULL (internally -ENOSPC) then it tries 
> to evict another buffer from the LRU by calling "ttm_mem_evict_first". 
> And then try again (until ?).
>
> For some reasons, after some points while running an app that GL 
> upload a lot of images, these 2 functions keeps returning 0 with 
> mem->mm_node as NULL so the "while (true)" keeps looping indefinitely. 
> Which results in the process to be stuck in that ioctl for ever.
>
> A nasty workaround is to break the loop after a threshold for the 
> number of iterations. It looks like it very rarely goes over 200. So 
> breaking if > 200 iteration and returning -ENOMEM allows the 
> application to get the hand back instead of being stuck. This is quite 
> helpful for the debugging phase but definitely not a proper fix.
>
> A colleague found that changing ttm_bo_unreserve by __ttm_bo_unreserve 
> here 
> https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L751 
> fixes this softlock. Because the later does not re-add the evicted 
> buffer to the LRU.
> But we are unsure whether this is a proper fix or just a workaround, 
> providing this line exists since the first TTM commit in 2009. Any 
> comment ?
>
> Also it looks like there is a recursion from:
>
> radeon_cs_ioctl
> radeon_cs_parser_relocs
> radeon_bo_list_validate
> ttm_bo_validate
> ttm_bo_move_buffer
> ttm_bo_mem_space  @
> ttm_bo_mem_force_space
> ttm_mem_evict_first
> ttm_bo_evict
> ttm_bo_mem_space @
> ttm_mem_evict_first
> ...
>
> It looks it is meant to work like this but this make it complicated to 
> follow. So any input would be much appreciated. Especially about the 
> eviction mechanism + bo->evicted flag and how TTM manages the LRU for 
> corner cases like when the VRAM is full.
>
> I tried kernel 4.4, 4.8 and git HEAD from last week.
>
> Thx
> Julien
>
>
>
>
>
>
>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20170314/9b02e21f/attachment-0001.html>


More information about the dri-devel mailing list