[PATCH] Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole"
Felix Kuehling
felix.kuehling at amd.com
Fri Jan 5 20:20:34 UTC 2024
TBA/TMA were relocated to the upper half of the canonical address space.
I don't think that qualifies as 32-bit by definition. But maybe you're
using a different definition.
That said, if Mesa manages its own virtual address space in user mode,
and KFD maps the TMA/TBA at an address that Mesa believes to be free, I
can see how that would lead to problems.
That said, the fence refcount bug is another problem that may have been
exposed by the way that a crashing Mesa application shuts down.
Reverting Jay's patch certainly didn't fix that, but only hides the problem.
Regards,
Felix
On 2024-01-04 13:29, Marek Olšák wrote:
> Hi,
>
> I have received information that the original commit makes all 32-bit
> userspace VA allocations fail, so UMDs like Mesa can't even initialize
> and they either crash or fail to load. If TBA/TMA was relocated to the
> 32-bit address range, it would explain why UMDs can't allocate
> anything in that range.
>
> Marek
>
> On Wed, Jan 3, 2024 at 2:50 PM Jay Cornwall <jay.cornwall at amd.com> wrote:
>> On 1/3/2024 12:58, Felix Kuehling wrote:
>>
>>> A segfault in Mesa seems to be a different issue from what's mentioned
>>> in the commit message. I'd let Christian or Marek comment on
>>> compatibility with graphics UMDs. I'm not sure why this patch would
>>> affect them at all.
>> I was referencing this issue in OpenCL/OpenGL interop, which certainly looked related:
>>
>> [ 91.769002] amdgpu 0000:0a:00.0: amdgpu: bo 000000009bba4692 va 0x0800000000-0x08000001ff conflict with 0x0800000000-0x0800000002
>> [ 91.769141] ocltst[2781]: segfault at b2 ip 00007f3fb90a7c39 sp 00007ffd3c011ba0 error 4 in radeonsi_dri.so[7f3fb888e000+1196000] likely on CPU 15 (core 7, socket 0)
>>
>>> Looking at the logs in the tickets, it looks like a fence reference
>>> counting error. I don't see how Jay's patch could have caused that. I
>>> made another change in that code recently that could make a difference
>>> for this issue:
>>>
>>> commit 8f08c5b24ced1be7eb49692e4816c1916233c79b
>>> Author: Felix Kuehling <Felix.Kuehling at amd.com>
>>> Date: Fri Oct 27 18:21:55 2023 -0400
>>>
>>> drm/amdkfd: Run restore_workers on freezable WQs
>>>
>>> Make restore workers freezable so we don't have to explicitly
>>> flush them
>>> in suspend and GPU reset code paths, and we don't accidentally
>>> try to
>>> restore BOs while the GPU is suspended. Not having to flush
>>> restore_work
>>> also helps avoid lock/fence dependencies in the GPU reset case
>>> where we're
>>> not allowed to wait for fences.
>>>
>>> A side effect of this is, that we can now have multiple
>>> concurrent threads
>>> trying to signal the same eviction fence. Rework eviction fence
>>> signaling
>>> and replacement to account for that.
>>>
>>> The GPU reset path can no longer rely on restore_process_worker
>>> to resume
>>> queues because evict/restore workers can run independently of
>>> it. Instead
>>> call a new restore_process_helper directly.
>>>
>>> This is an RFC and request for testing.
>>>
>>> v2:
>>> - Reworked eviction fence signaling
>>> - Introduced restore_process_helper
>>>
>>> v3:
>>> - Handle unsignaled eviction fences in restore_process_bos
>>>
>>> Signed-off-by: Felix Kuehling <Felix.Kuehling at amd.com>
>>> Acked-by: Christian König <christian.koenig at amd.com>
>>> Tested-by: Emily Deng <Emily.Deng at amd.com>
>>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>>>
>>>
>>> FWIW, I built a plain 6.6 kernel, and was not able to reproduce the
>>> crash with some simple tests.
>>>
>>> Regards,
>>> Felix
>>>
>>>
>>>> So I agree, let's revert it.
>>>>
>>>> Reviewed-by: Jay Cornwall <jay.cornwall at amd.com>
More information about the amd-gfx
mailing list