Reverted another change to fix buffer move hangs (was Re: [PATCH] drm/ttm: partial revert "cleanup ttm_tt_(unbind|destroy)" v2)

Felix Kuehling felix.kuehling at amd.com
Fri Aug 12 23:22:23 UTC 2016


[CC Kent FYI]

On 16-08-11 04:31 PM, Deucher, Alexander wrote:
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf
>> Of Felix Kuehling
>> Sent: Thursday, August 11, 2016 3:52 PM
>> To: Michel Dänzer; Christian König
>> Cc: amd-gfx at lists.freedesktop.org
>> Subject: Reverted another change to fix buffer move hangs (was Re:
>> [PATCH] drm/ttm: partial revert "cleanup ttm_tt_(unbind|destroy)" v2)
>>
>> We had to revert another change on the KFD branch to fix a buffer move
>> problem: 8b6b79f43801f00ddcdc10a4d5719eba4b2e32aa (drm/amdgpu:
>> group BOs
>> by log2 of the size on the LRU v2
> That makes sense.  I think you may want a different LRU scheme for KFD or at least special handling for KFD buffers.

[FK] But I think the patch shouldn't cause hangs, regardless.

I eventually found what the problem was. The "group BOs by log2 of the
size on the LRU v2" patch exposed a latent bug related to the GART size.
On our KFD branch, we calculate the GART size differently, and it can
easily go above 4GB. I think on amd-staging-4.6 the GART size can also
go above 4GB on cards with lots of VRAM.

However, the offset parameter in amdgpu_gart_bind and unbind is only
32-bit. With the patch our test ended up using GART offsets beyond 4GB
for the first time. Changing the offset parameter to uint64_t fixes the
problem.

Our test also demonstrates a potential flaw in the log2 grouping patch:
When a buffer of a previously unused size is added to the LRU, it gets
added to the front of the list, rather than the tail. So an application
that allocates a very large buffer after a bunch of smaller buffers, is
very likely to have that buffer evicted over and over again before any
smaller buffers are considered for eviction. I believe, this can result
in thrashing of large buffers.

Some other observations: When the last BO of a given size is removed
from the LRU list, the LRU tail for that size is left "floating" in the
middle of the LRU list. So the next BO of that size that is added, will
be added at an arbitrary position in the list. It may even end up in the
middle of a block of pages of a different size. So a log2 grouping may
end up being split.

Regards,
  Felix



More information about the amd-gfx mailing list