[PATCH] Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole"

Fri Jan 5 20:20:34 UTC 2024

TBA/TMA were relocated to the upper half of the canonical address space. 
I don't think that qualifies as 32-bit by definition. But maybe you're 
using a different definition.

That said, if Mesa manages its own virtual address space in user mode, 
and KFD maps the TMA/TBA at an address that Mesa believes to be free, I 
can see how that would lead to problems.

That said, the fence refcount bug is another problem that may have been 
exposed by the way that a crashing Mesa application shuts down. 
Reverting Jay's patch certainly didn't fix that, but only hides the problem.

Regards,
   Felix

On 2024-01-04 13:29, Marek Olšák wrote:
> Hi,
>
> I have received information that the original commit makes all 32-bit
> userspace VA allocations fail, so UMDs like Mesa can't even initialize
> and they either crash or fail to load. If TBA/TMA was relocated to the
> 32-bit address range, it would explain why UMDs can't allocate
> anything in that range.
>
> Marek
>
> On Wed, Jan 3, 2024 at 2:50 PM Jay Cornwall <jay.cornwall at amd.com> wrote:
>> On 1/3/2024 12:58, Felix Kuehling wrote:
>>
>>> A segfault in Mesa seems to be a different issue from what's mentioned
>>> in the commit message. I'd let Christian or Marek comment on
>>> compatibility with graphics UMDs. I'm not sure why this patch would
>>> affect them at all.
>> I was referencing this issue in OpenCL/OpenGL interop, which certainly looked related:
>>
>> [   91.769002] amdgpu 0000:0a:00.0: amdgpu: bo 000000009bba4692 va 0x0800000000-0x08000001ff conflict with 0x0800000000-0x0800000002
>> [   91.769141] ocltst[2781]: segfault at b2 ip 00007f3fb90a7c39 sp 00007ffd3c011ba0 error 4 in radeonsi_dri.so[7f3fb888e000+1196000] likely on CPU 15 (core 7, socket 0)
>>
>>> Looking at the logs in the tickets, it looks like a fence reference
>>> counting error. I don't see how Jay's patch could have caused that. I
>>> made another change in that code recently that could make a difference
>>> for this issue:
>>>
>>>      commit 8f08c5b24ced1be7eb49692e4816c1916233c79b
>>>      Author: Felix Kuehling <Felix.Kuehling at amd.com>
>>>      Date:   Fri Oct 27 18:21:55 2023 -0400
>>>
>>>           drm/amdkfd: Run restore_workers on freezable WQs
>>>
>>>           Make restore workers freezable so we don't have to explicitly
>>>      flush them
>>>           in suspend and GPU reset code paths, and we don't accidentally
>>>      try to
>>>           restore BOs while the GPU is suspended. Not having to flush
>>>      restore_work
>>>           also helps avoid lock/fence dependencies in the GPU reset case
>>>      where we're
>>>           not allowed to wait for fences.
>>>
>>>           A side effect of this is, that we can now have multiple
>>>      concurrent threads
>>>           trying to signal the same eviction fence. Rework eviction fence
>>>      signaling
>>>           and replacement to account for that.
>>>
>>>           The GPU reset path can no longer rely on restore_process_worker
>>>      to resume
>>>           queues because evict/restore workers can run independently of
>>>      it. Instead
>>>           call a new restore_process_helper directly.
>>>
>>>           This is an RFC and request for testing.
>>>
>>>           v2:
>>>           - Reworked eviction fence signaling
>>>           - Introduced restore_process_helper
>>>
>>>           v3:
>>>           - Handle unsignaled eviction fences in restore_process_bos
>>>
>>>           Signed-off-by: Felix Kuehling <Felix.Kuehling at amd.com>
>>>           Acked-by: Christian König <christian.koenig at amd.com>
>>>           Tested-by: Emily Deng <Emily.Deng at amd.com>
>>>           Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>>>
>>>
>>> FWIW, I built a plain 6.6 kernel, and was not able to reproduce the
>>> crash with some simple tests.
>>>
>>> Regards,
>>>     Felix
>>>
>>>
>>>> So I agree, let's revert it.
>>>>
>>>> Reviewed-by: Jay Cornwall <jay.cornwall at amd.com>