[infinite loop] need some info about TTM's buffer eviction mechanism

Julien Isorce julien.isorce at gmail.com
Tue Mar 14 16:31:34 UTC 2017


Hello,

While debugging a softlock that happens on an ioctl(RADEON_CS), I found
that it keeps looping indefinitely in the following loop:
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L819

That would be great if someone could explain the logic behind this loop
iteration. My understanding is that it tries to get a free node to put the
current buffer object by calling "ttm_bo_man_get_node". If it fails with
mem->mm_node as NULL (internally -ENOSPC) then it tries to evict another
buffer from the LRU by calling "ttm_mem_evict_first". And then try again
(until ?).

For some reasons, after some points while running an app that GL upload a
lot of images, these 2 functions keeps returning 0 with mem->mm_node as
NULL so the "while (true)" keeps looping indefinitely. Which results in the
process to be stuck in that ioctl for ever.

A nasty workaround is to break the loop after a threshold for the number of
iterations. It looks like it very rarely goes over 200. So breaking if >
200 iteration and returning -ENOMEM allows the application to get the hand
back instead of being stuck. This is quite helpful for the debugging phase
but definitely not a proper fix.

A colleague found that changing ttm_bo_unreserve by __ttm_bo_unreserve here
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L751
fixes this softlock. Because the later does not re-add the evicted buffer
to the LRU.
But we are unsure whether this is a proper fix or just a workaround,
providing this line exists since the first TTM commit in 2009. Any comment ?

Also it looks like there is a recursion from:

radeon_cs_ioctl
radeon_cs_parser_relocs
radeon_bo_list_validate
ttm_bo_validate
ttm_bo_move_buffer
ttm_bo_mem_space  @
ttm_bo_mem_force_space
ttm_mem_evict_first
ttm_bo_evict
ttm_bo_mem_space @
ttm_mem_evict_first
...

It looks it is meant to work like this but this make it complicated to
follow. So any input would be much appreciated. Especially about the
eviction mechanism + bo->evicted flag and how TTM manages the LRU for
corner cases like when the VRAM is full.

I tried kernel 4.4, 4.8 and git HEAD from last week.

Thx
Julien
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20170314/cc7a2047/attachment.html>


More information about the dri-devel mailing list