[PATCH] Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole"

Jay Cornwall jay.cornwall at amd.com
Wed Jan 3 19:13:32 UTC 2024


On 1/3/2024 12:58, Felix Kuehling wrote:

> A segfault in Mesa seems to be a different issue from what's mentioned 
> in the commit message. I'd let Christian or Marek comment on 
> compatibility with graphics UMDs. I'm not sure why this patch would 
> affect them at all.

I was referencing this issue in OpenCL/OpenGL interop, which certainly looked related:

[   91.769002] amdgpu 0000:0a:00.0: amdgpu: bo 000000009bba4692 va 0x0800000000-0x08000001ff conflict with 0x0800000000-0x0800000002
[   91.769141] ocltst[2781]: segfault at b2 ip 00007f3fb90a7c39 sp 00007ffd3c011ba0 error 4 in radeonsi_dri.so[7f3fb888e000+1196000] likely on CPU 15 (core 7, socket 0)

> 
> Looking at the logs in the tickets, it looks like a fence reference 
> counting error. I don't see how Jay's patch could have caused that. I 
> made another change in that code recently that could make a difference 
> for this issue:
> 
>     commit 8f08c5b24ced1be7eb49692e4816c1916233c79b
>     Author: Felix Kuehling <Felix.Kuehling at amd.com>
>     Date:   Fri Oct 27 18:21:55 2023 -0400
> 
>          drm/amdkfd: Run restore_workers on freezable WQs
> 
>          Make restore workers freezable so we don't have to explicitly
>     flush them
>          in suspend and GPU reset code paths, and we don't accidentally
>     try to
>          restore BOs while the GPU is suspended. Not having to flush
>     restore_work
>          also helps avoid lock/fence dependencies in the GPU reset case
>     where we're
>          not allowed to wait for fences.
> 
>          A side effect of this is, that we can now have multiple
>     concurrent threads
>          trying to signal the same eviction fence. Rework eviction fence
>     signaling
>          and replacement to account for that.
> 
>          The GPU reset path can no longer rely on restore_process_worker
>     to resume
>          queues because evict/restore workers can run independently of
>     it. Instead
>          call a new restore_process_helper directly.
> 
>          This is an RFC and request for testing.
> 
>          v2:
>          - Reworked eviction fence signaling
>          - Introduced restore_process_helper
> 
>          v3:
>          - Handle unsignaled eviction fences in restore_process_bos
> 
>          Signed-off-by: Felix Kuehling <Felix.Kuehling at amd.com>
>          Acked-by: Christian König <christian.koenig at amd.com>
>          Tested-by: Emily Deng <Emily.Deng at amd.com>
>          Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> 
> 
> FWIW, I built a plain 6.6 kernel, and was not able to reproduce the 
> crash with some simple tests.
> 
> Regards,
>    Felix
> 
> 
>>
>> So I agree, let's revert it.
>>
>> Reviewed-by: Jay Cornwall <jay.cornwall at amd.com>



More information about the amd-gfx mailing list