Reverted another change to fix buffer move hangs (was Re: [PATCH] drm/ttm: partial revert "cleanup ttm_tt_(unbind|destroy)" v2)

Deucher, Alexander Alexander.Deucher at amd.com
Thu Aug 11 20:31:31 UTC 2016


> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf
> Of Felix Kuehling
> Sent: Thursday, August 11, 2016 3:52 PM
> To: Michel Dänzer; Christian König
> Cc: amd-gfx at lists.freedesktop.org
> Subject: Reverted another change to fix buffer move hangs (was Re:
> [PATCH] drm/ttm: partial revert "cleanup ttm_tt_(unbind|destroy)" v2)
> 
> We had to revert another change on the KFD branch to fix a buffer move
> problem: 8b6b79f43801f00ddcdc10a4d5719eba4b2e32aa (drm/amdgpu:
> group BOs
> by log2 of the size on the LRU v2

That makes sense.  I think you may want a different LRU scheme for KFD or at least special handling for KFD buffers.

Alex

> 
> We haven't looked into this change in detail yet, to understand the
> cause. Kent found it by bisecting on amd-staging-4.6 and applying KFD
> changes on top.
> 
> Regards,
>   Felix
> 
> On 16-08-05 11:06 AM, Felix Kuehling wrote:
> > For the record, Michel's patch "drm/ttm: Wait for a BO to become idle
> > before unbinding it from GTT" fixes our KFD problem as well.
> >
> > Thanks,
> >   Felix
> >
> > On 16-07-27 05:27 PM, Felix Kuehling wrote:
> >> We're also looking into a hang with a KFD unit test that allocates lots
> >> of memory and fragments it deliberately, without mapping it all at once.
> >> It's a new problem for us as we're rebasing on amd-staging-4.6.
> >> Something weird seems to be happening with evictions, but I haven't
> been
> >> able to figure it out.
> >>
> >> I was able to see that SDMA page table updates stop working at some
> >> point, though SDMA fences are still signaling. If I let the test run
> >> longer, SDMA and CP hang. I dumped the SDMA IBs and didn't see
> anything
> >> suspicious. My guess was that maybe the SDMA IBs or the ring are getting
> >> corrupted, or maybe the GART table entries for the IBs or ring are
> >> corrupted. But I haven't been able to prove that or track it down to a
> >> root cause. We're now trying to reimplement the test using libdrm-
> amdgpu
> >> APIs so we can bisect on the amd-staging-4.6 branch without KFD.
> >>
> >> Regards,
> >>   Felix
> >>
> >> On 16-07-26 10:26 PM, Michel Dänzer wrote:
> >>> On 22.07.2016 22:10, Christian König wrote:
> >>>> From: Christian König <christian.koenig at amd.com>
> >>>>
> >>>> We still need to unbind explicitely during a move.
> >>> This change fixed a hang for me when running the piglit test
> >>> max-texture-size with the radeon driver on Kaveri.
> >>>
> >>> However, there's still a similar hang left when letting the piglit test
> >>> tex3d-maxsize run concurrently with other tests (running tex3d-maxsize
> >>> alone doesn't hang, but fails due to running out of GPU memory; that's a
> >>> recent radeonsi regression). There are
> >>>
> >>>  [TTM] Buffer eviction failed
> >>>
> >>> messages in dmesg shortly before the hang.
> >>>
> >>> I haven't seen such hangs with older kernels. Any ideas offhand what
> the
> >>> problem could be? If not, I'll try bisecting.
> >>>
> >>>
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


More information about the amd-gfx mailing list