[PATCH] Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole"
Marek Olšák
maraeo at gmail.com
Thu Jan 4 18:29:30 UTC 2024
Hi,
I have received information that the original commit makes all 32-bit
userspace VA allocations fail, so UMDs like Mesa can't even initialize
and they either crash or fail to load. If TBA/TMA was relocated to the
32-bit address range, it would explain why UMDs can't allocate
anything in that range.
Marek
On Wed, Jan 3, 2024 at 2:50 PM Jay Cornwall <jay.cornwall at amd.com> wrote:
>
> On 1/3/2024 12:58, Felix Kuehling wrote:
>
> > A segfault in Mesa seems to be a different issue from what's mentioned
> > in the commit message. I'd let Christian or Marek comment on
> > compatibility with graphics UMDs. I'm not sure why this patch would
> > affect them at all.
>
> I was referencing this issue in OpenCL/OpenGL interop, which certainly looked related:
>
> [ 91.769002] amdgpu 0000:0a:00.0: amdgpu: bo 000000009bba4692 va 0x0800000000-0x08000001ff conflict with 0x0800000000-0x0800000002
> [ 91.769141] ocltst[2781]: segfault at b2 ip 00007f3fb90a7c39 sp 00007ffd3c011ba0 error 4 in radeonsi_dri.so[7f3fb888e000+1196000] likely on CPU 15 (core 7, socket 0)
>
> >
> > Looking at the logs in the tickets, it looks like a fence reference
> > counting error. I don't see how Jay's patch could have caused that. I
> > made another change in that code recently that could make a difference
> > for this issue:
> >
> > commit 8f08c5b24ced1be7eb49692e4816c1916233c79b
> > Author: Felix Kuehling <Felix.Kuehling at amd.com>
> > Date: Fri Oct 27 18:21:55 2023 -0400
> >
> > drm/amdkfd: Run restore_workers on freezable WQs
> >
> > Make restore workers freezable so we don't have to explicitly
> > flush them
> > in suspend and GPU reset code paths, and we don't accidentally
> > try to
> > restore BOs while the GPU is suspended. Not having to flush
> > restore_work
> > also helps avoid lock/fence dependencies in the GPU reset case
> > where we're
> > not allowed to wait for fences.
> >
> > A side effect of this is, that we can now have multiple
> > concurrent threads
> > trying to signal the same eviction fence. Rework eviction fence
> > signaling
> > and replacement to account for that.
> >
> > The GPU reset path can no longer rely on restore_process_worker
> > to resume
> > queues because evict/restore workers can run independently of
> > it. Instead
> > call a new restore_process_helper directly.
> >
> > This is an RFC and request for testing.
> >
> > v2:
> > - Reworked eviction fence signaling
> > - Introduced restore_process_helper
> >
> > v3:
> > - Handle unsignaled eviction fences in restore_process_bos
> >
> > Signed-off-by: Felix Kuehling <Felix.Kuehling at amd.com>
> > Acked-by: Christian König <christian.koenig at amd.com>
> > Tested-by: Emily Deng <Emily.Deng at amd.com>
> > Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> >
> >
> > FWIW, I built a plain 6.6 kernel, and was not able to reproduce the
> > crash with some simple tests.
> >
> > Regards,
> > Felix
> >
> >
> >>
> >> So I agree, let's revert it.
> >>
> >> Reviewed-by: Jay Cornwall <jay.cornwall at amd.com>
>
More information about the amd-gfx
mailing list