[PATCH] Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole"
Jay Cornwall
jay.cornwall at amd.com
Wed Jan 3 19:13:32 UTC 2024
On 1/3/2024 12:58, Felix Kuehling wrote:
> A segfault in Mesa seems to be a different issue from what's mentioned
> in the commit message. I'd let Christian or Marek comment on
> compatibility with graphics UMDs. I'm not sure why this patch would
> affect them at all.
I was referencing this issue in OpenCL/OpenGL interop, which certainly looked related:
[ 91.769002] amdgpu 0000:0a:00.0: amdgpu: bo 000000009bba4692 va 0x0800000000-0x08000001ff conflict with 0x0800000000-0x0800000002
[ 91.769141] ocltst[2781]: segfault at b2 ip 00007f3fb90a7c39 sp 00007ffd3c011ba0 error 4 in radeonsi_dri.so[7f3fb888e000+1196000] likely on CPU 15 (core 7, socket 0)
>
> Looking at the logs in the tickets, it looks like a fence reference
> counting error. I don't see how Jay's patch could have caused that. I
> made another change in that code recently that could make a difference
> for this issue:
>
> commit 8f08c5b24ced1be7eb49692e4816c1916233c79b
> Author: Felix Kuehling <Felix.Kuehling at amd.com>
> Date: Fri Oct 27 18:21:55 2023 -0400
>
> drm/amdkfd: Run restore_workers on freezable WQs
>
> Make restore workers freezable so we don't have to explicitly
> flush them
> in suspend and GPU reset code paths, and we don't accidentally
> try to
> restore BOs while the GPU is suspended. Not having to flush
> restore_work
> also helps avoid lock/fence dependencies in the GPU reset case
> where we're
> not allowed to wait for fences.
>
> A side effect of this is, that we can now have multiple
> concurrent threads
> trying to signal the same eviction fence. Rework eviction fence
> signaling
> and replacement to account for that.
>
> The GPU reset path can no longer rely on restore_process_worker
> to resume
> queues because evict/restore workers can run independently of
> it. Instead
> call a new restore_process_helper directly.
>
> This is an RFC and request for testing.
>
> v2:
> - Reworked eviction fence signaling
> - Introduced restore_process_helper
>
> v3:
> - Handle unsignaled eviction fences in restore_process_bos
>
> Signed-off-by: Felix Kuehling <Felix.Kuehling at amd.com>
> Acked-by: Christian König <christian.koenig at amd.com>
> Tested-by: Emily Deng <Emily.Deng at amd.com>
> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>
>
> FWIW, I built a plain 6.6 kernel, and was not able to reproduce the
> crash with some simple tests.
>
> Regards,
> Felix
>
>
>>
>> So I agree, let's revert it.
>>
>> Reviewed-by: Jay Cornwall <jay.cornwall at amd.com>
More information about the amd-gfx
mailing list