Reverted another change to fix buffer move hangs (was Re: [PATCH] drm/ttm: partial revert "cleanup ttm_tt_(unbind|destroy)" v2)

Christian König deathsimple at vodafone.de
Sat Aug 13 09:25:25 UTC 2016


Am 13.08.2016 um 01:22 schrieb Felix Kuehling:
> [CC Kent FYI]
>
> On 16-08-11 04:31 PM, Deucher, Alexander wrote:
>>> -----Original Message-----
>>> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf
>>> Of Felix Kuehling
>>> Sent: Thursday, August 11, 2016 3:52 PM
>>> To: Michel Dänzer; Christian König
>>> Cc: amd-gfx at lists.freedesktop.org
>>> Subject: Reverted another change to fix buffer move hangs (was Re:
>>> [PATCH] drm/ttm: partial revert "cleanup ttm_tt_(unbind|destroy)" v2)
>>>
>>> We had to revert another change on the KFD branch to fix a buffer move
>>> problem: 8b6b79f43801f00ddcdc10a4d5719eba4b2e32aa (drm/amdgpu:
>>> group BOs
>>> by log2 of the size on the LRU v2
>> That makes sense.  I think you may want a different LRU scheme for KFD or at least special handling for KFD buffers.
> [FK] But I think the patch shouldn't cause hangs, regardless.
>
> I eventually found what the problem was. The "group BOs by log2 of the
> size on the LRU v2" patch exposed a latent bug related to the GART size.
> On our KFD branch, we calculate the GART size differently, and it can
> easily go above 4GB. I think on amd-staging-4.6 the GART size can also
> go above 4GB on cards with lots of VRAM.
>
> However, the offset parameter in amdgpu_gart_bind and unbind is only
> 32-bit. With the patch our test ended up using GART offsets beyond 4GB
> for the first time. Changing the offset parameter to uint64_t fixes the
> problem.

Nice catch, please provide a patch to fix this.

> Our test also demonstrates a potential flaw in the log2 grouping patch:
> When a buffer of a previously unused size is added to the LRU, it gets
> added to the front of the list, rather than the tail. So an application
> that allocates a very large buffer after a bunch of smaller buffers, is
> very likely to have that buffer evicted over and over again before any
> smaller buffers are considered for eviction. I believe, this can result
> in thrashing of large buffers.
>
> Some other observations: When the last BO of a given size is removed
> from the LRU list, the LRU tail for that size is left "floating" in the
> middle of the LRU list. So the next BO of that size that is added, will
> be added at an arbitrary position in the list. It may even end up in the
> middle of a block of pages of a different size. So a log2 grouping may
> end up being split.

Yeah, those are more or less known issues.

Keep in mind that we only added the grouping by log2 of the size to have 
a justification to push the TTM changes upstream for the coming KFD fences.

E.g. so that we are able to have this upstream before we try to push on 
the fence code.

I will take a look at fixing those issues when I have time, shouldn't be 
to complicated to set the entries to zero when they aren't used or 
adjust other entries as well when some are added.

Regards,
Christian.

>
> Regards,
>    Felix
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx




More information about the amd-gfx mailing list