[PATCH] drm/ttm: remove race and optimise evicting destroyed buffers.
Thomas Hellstrom
thomas at shipmail.org
Wed Sep 29 03:28:23 PDT 2010
On 09/29/2010 05:45 AM, Dave Airlie wrote:
> From: Dave Airlie<airlied at redhat.com>
>
> A race condition was identifed that led to a leak of the BOs, when
> a bo was on the delayed delete list and was selected for eviction,
> the code would enter
>
> thread 1 (evict) -- thread2 (dd)
> bo added to delayed destroy list
> take lru lock
> ttm_mem_evict_first called
> - reserve locked
> - remove from lru
> - drop lru lock
> take lru lock
> try del from lru - get put_count == 0
> try to reserve locked
> - drop lru lock to wait unreserved
> call bo_evict
> unreserve buffer
> - take lru lock
> - add back to lru
> - unreserver
> - drop lru lock
> take lru lock due to unreserved
> unbind ttm
> drop put_count references (which is 0)
> despite being on the lru/swap lru lists.
> leak due to never losing reference
>
> The obvious quick fix is to try the bo from the ddestroy list if we
> are going to evict it, however if we are going to evict a buffer that
> is going to be destroyed we should just destroy it instead, its more
> likely to be a lighter-weight operation than evict + delayed destroy.
>
> This patch check is a bo that we are about to evict is actually on the
> destroy list and if it is, it unreserves it (without adding to lru),
> and cleans up its references. It should then get destroyed via
> normal ref counting means.
>
>
Dave, this will not work, since we can't guarantee that the buffer will
be destroyed immediatly. There may be other holders of a list_kref, and
if the destruction doesn't happen immediately, we don't get immediate
eviction.
If we take a command submission lock and decide to evict all buffers to
clean up fragmentation, we need to be able to guarantee that all buffers
are evicted synchronously.
Having said that, I think it should be possible to implement this if we
in addition to the above code free the mm_node immediately and don't
rely on the buffer being destroyed to do that.
But that's an optimization rather than a bug fix. The real bug sits in
ttm_bo_cleanup_refs(), in the fact that we remove the buffers from the
lru lists before we reserve.
Let me try to come up with a fix for that. I think there needs to be a
second wait for bo idle as well, since in theory, someone might have
started an eviction using an accelerated path..
/Thomas
> proposed fix for:
> https://bugzilla.redhat.com/show_bug.cgi?id=615505
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=591061
>
> Signed-off-by: Dave Airlie<airlied at redhat.com>
> ---
> drivers/gpu/drm/ttm/ttm_bo.c | 18 +++++++++++++++---
> 1 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index cb4cf7e..60689b1 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -689,7 +689,7 @@ static int ttm_mem_evict_first(struct ttm_bo_device *bdev,
> struct ttm_mem_type_manager *man =&bdev->man[mem_type];
> struct ttm_buffer_object *bo;
> int ret, put_count = 0;
> -
> + bool destroy = false;
> retry:
> spin_lock(&glob->lru_lock);
> if (list_empty(&man->lru)) {
> @@ -719,6 +719,13 @@ retry:
> }
>
> put_count = ttm_bo_del_from_lru(bo);
> +
> + /* is the buffer currently on the delayed destroy list? */
> + if (!list_empty(&bo->ddestroy)) {
> + list_del_init(&bo->ddestroy);
> + destroy = true;
> + put_count++;
> + }
> spin_unlock(&glob->lru_lock);
>
> BUG_ON(ret != 0);
> @@ -726,8 +733,13 @@ retry:
> while (put_count--)
> kref_put(&bo->list_kref, ttm_bo_ref_bug);
>
> - ret = ttm_bo_evict(bo, interruptible, no_wait_reserve, no_wait_gpu);
> - ttm_bo_unreserve(bo);
> + if (destroy) {
> + atomic_set(&bo->reserved, 0);
> + ret = ttm_bo_cleanup_refs(bo, !no_wait_gpu);
> + } else {
> + ret = ttm_bo_evict(bo, interruptible, no_wait_reserve, no_wait_gpu);
> + ttm_bo_unreserve(bo);
> + }
>
> kref_put(&bo->list_kref, ttm_bo_release_list);
> return ret;
>
More information about the dri-devel
mailing list