drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole (v2)
Shengyu Qu
wiagn233 at outlook.com
Tue Jan 30 01:22:43 UTC 2024
Hi Felix,
Thanks for reply. I'll record a backtrace when I'm free. Besides, here is
a dmesg log from someone else in the issue discussion about this problem:
https://projects.blender.org/attachments/ea7b7db5-ac16-479d-935b-9e1da33cd6f0
Tested using next-20240129 with this patch applied, and setup is Plasma 6.0
RC1(Wayland) + RX 6600 XT.
Best regards,
Shengyu
在 2024/1/30 1:47, Felix Kuehling 写道:
> On 2024-01-29 10:24, Shengyu Qu wrote:
>> Hello Felix,
>> I think you are right. This problem has existed for years(just look at
>> the
>> issue creation time in my link), and is thought caused by OpenGL-ROCMTe
>> interop(that's why I think this patch might help). It is very easy to
>> trigger this problem in blender(method is also mentioned in the link).
>
> This doesn't help you, but it's unlikely that this has been the same
> issue for two years for everybody who chimed into this bug report.
> Different kernel versions, GPUs, user mode ROCm and Mesa versions etc.
>
> Case in point, it's possible that you're seeing an issue specific to
> RDNA3, which hasn't even been around for that long.
>
>
>> Do
>> you have any idea about this?
>
> Not without seeing a lot more diagnostic information. A full backtrace
> from your kernel log would be a good start.
>
> Regards,
> Felix
>
>
>> Best regards,
>> Shengyu
>> 在 2024/1/29 22:51, Felix Kuehling 写道:
>>> On 2024-01-29 8:58, Shengyu Qu wrote:
>>>> Hi,
>>>> Seems rocm-opengl interop hang problem still exists[1]. Btw have you
>>>> discovered into this problem?
>>>> Best regards,
>>>> Shengyu
>>>> [1]
>>>> https://projects.blender.org/blender/blender/issues/100353#issuecomment-1111599
>>>
>>> Maybe you're having a different problem. Do you see this issue also
>>> without any version of the "Relocate TBA/TMA ..." patch?
>>>
>>> Regards,
>>> Felix
>>>
>>>
>>>>
>>>> 在 2024/1/27 03:15, Shengyu Qu 写道:
>>>>> Hello Felix,
>>>>> This patch seems working on my system, also it seems fixes the
>>>>> ROCM/OpenGL
>>>>> interop problem.
>>>>> Is this intended to happen or not? Maybe we need more users to test
>>>>> it.
>>>>> Besides,
>>>>> Tested-by: Shengyu Qu <wiagn233 at outlook.com>
>>>>> Best Regards,
>>>>> Shengyu
>>>>>
>>>>> 在 2024/1/26 06:27, Felix Kuehling 写道:
>>>>>> The TBA and TMA, along with an unused IB allocation, reside at low
>>>>>> addresses in the VM address space. A stray VM fault which hits these
>>>>>> pages must be serviced by making their page table entries invalid.
>>>>>> The scheduler depends upon these pages being resident and fails,
>>>>>> preventing a debugger from inspecting the failure state.
>>>>>>
>>>>>> By relocating these pages above 47 bits in the VM address space they
>>>>>> can only be reached when bits [63:48] are set to 1. This makes it
>>>>>> much
>>>>>> less likely for a misbehaving program to generate accesses to them.
>>>>>> The current placement at VA (PAGE_SIZE*2) is readily hit by a NULL
>>>>>> access with a small offset.
>>>>>>
>>>>>> v2:
>>>>>> - Move it to the reserved space to avoid concflicts with Mesa
>>>>>> - Add macros to make reserved space management easier
>>>>>>
>>>>>> Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam at amd.com>
>>>>>> Cc: Christian Koenig <christian.koenig at amd.com>
>>>>>> Signed-off-by: Jay Cornwall <jay.cornwall at amd.com>
>>>>>> Signed-off-by: Felix Kuehling <felix.kuehling at amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 4 +--
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c | 7 ++---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 12 ++++++--
>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c | 30
>>>>>> +++++++++++---------
>>>>>> 4 files changed, 30 insertions(+), 23 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>>>>>> index 823d31f4a2a3..53d0a458d78e 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>>>>>> @@ -28,9 +28,9 @@
>>>>>> uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)
>>>>>> {
>>>>>> - uint64_t addr = adev->vm_manager.max_pfn <<
>>>>>> AMDGPU_GPU_PAGE_SHIFT;
>>>>>> + uint64_t addr = AMDGPU_VA_RESERVED_CSA_START(
>>>>>> + adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT);
>>>>>> - addr -= AMDGPU_VA_RESERVED_CSA_SIZE;
>>>>>> addr = amdgpu_gmc_sign_extend(addr);
>>>>>> return addr;
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c
>>>>>> index 3d0d56087d41..9e769ef50f2e 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c
>>>>>> @@ -45,11 +45,8 @@
>>>>>> */
>>>>>> static inline u64 amdgpu_seq64_get_va_base(struct amdgpu_device
>>>>>> *adev)
>>>>>> {
>>>>>> - u64 addr = adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT;
>>>>>> -
>>>>>> - addr -= AMDGPU_VA_RESERVED_TOP;
>>>>>> -
>>>>>> - return addr;
>>>>>> + return AMDGPU_VA_RESERVED_SEQ64_START(
>>>>>> + adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT);
>>>>>> }
>>>>>> /**
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>> index 666698a57192..f23b6153d310 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>> @@ -135,11 +135,19 @@ struct amdgpu_mem_stats;
>>>>>> #define AMDGPU_IS_MMHUB0(x) ((x) >= AMDGPU_MMHUB0_START && (x) <
>>>>>> AMDGPU_MMHUB1_START)
>>>>>> #define AMDGPU_IS_MMHUB1(x) ((x) >= AMDGPU_MMHUB1_START && (x) <
>>>>>> AMDGPU_MAX_VMHUBS)
>>>>>> -/* Reserve 2MB at top/bottom of address space for kernel use */
>>>>>> +/* Reserve space at top/bottom of address space for kernel use */
>>>>>> #define AMDGPU_VA_RESERVED_CSA_SIZE (2ULL << 20)
>>>>>> +#define AMDGPU_VA_RESERVED_CSA_START(top) ((top) \
>>>>>> + - AMDGPU_VA_RESERVED_CSA_SIZE)
>>>>>> #define AMDGPU_VA_RESERVED_SEQ64_SIZE (2ULL << 20)
>>>>>> +#define AMDGPU_VA_RESERVED_SEQ64_START(top)
>>>>>> (AMDGPU_VA_RESERVED_CSA_START(top) \
>>>>>> + - AMDGPU_VA_RESERVED_SEQ64_SIZE)
>>>>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE (2ULL << 12)
>>>>>> +#define AMDGPU_VA_RESERVED_TRAP_START(top)
>>>>>> (AMDGPU_VA_RESERVED_SEQ64_START(top) \
>>>>>> + - AMDGPU_VA_RESERVED_TRAP_SIZE)
>>>>>> #define AMDGPU_VA_RESERVED_BOTTOM (2ULL << 20)
>>>>>> -#define AMDGPU_VA_RESERVED_TOP (AMDGPU_VA_RESERVED_SEQ64_SIZE + \
>>>>>> +#define AMDGPU_VA_RESERVED_TOP (AMDGPU_VA_RESERVED_TRAP_SIZE + \
>>>>>> + AMDGPU_VA_RESERVED_SEQ64_SIZE + \
>>>>>> AMDGPU_VA_RESERVED_CSA_SIZE)
>>>>>> /* See vm_update_mode */
>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
>>>>>> index 6604a3f99c5e..f899cce25b2a 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
>>>>>> @@ -36,6 +36,7 @@
>>>>>> #include <linux/mm.h>
>>>>>> #include <linux/mman.h>
>>>>>> #include <linux/processor.h>
>>>>>> +#include "amdgpu_vm.h"
>>>>>> /*
>>>>>> * The primary memory I/O features being added for revisions of
>>>>>> gfxip
>>>>>> @@ -326,10 +327,16 @@ static void kfd_init_apertures_vi(struct
>>>>>> kfd_process_device *pdd, uint8_t id)
>>>>>> * with small reserved space for kernel.
>>>>>> * Set them to CANONICAL addresses.
>>>>>> */
>>>>>> - pdd->gpuvm_base = SVM_USER_BASE;
>>>>>> + pdd->gpuvm_base = max(SVM_USER_BASE, AMDGPU_VA_RESERVED_BOTTOM);
>>>>>> pdd->gpuvm_limit =
>>>>>> pdd->dev->kfd->shared_resources.gpuvm_size - 1;
>>>>>> + /* dGPUs: the reserved space for kernel
>>>>>> + * before SVM
>>>>>> + */
>>>>>> + pdd->qpd.cwsr_base = SVM_CWSR_BASE;
>>>>>> + pdd->qpd.ib_base = SVM_IB_BASE;
>>>>>> +
>>>>>> pdd->scratch_base = MAKE_SCRATCH_APP_BASE_VI();
>>>>>> pdd->scratch_limit = MAKE_SCRATCH_APP_LIMIT(pdd->scratch_base);
>>>>>> }
>>>>>> @@ -339,18 +346,19 @@ static void kfd_init_apertures_v9(struct
>>>>>> kfd_process_device *pdd, uint8_t id)
>>>>>> pdd->lds_base = MAKE_LDS_APP_BASE_V9();
>>>>>> pdd->lds_limit = MAKE_LDS_APP_LIMIT(pdd->lds_base);
>>>>>> - /* Raven needs SVM to support graphic handle, etc.
>>>>>> Leave the small
>>>>>> - * reserved space before SVM on Raven as well, even
>>>>>> though we don't
>>>>>> - * have to.
>>>>>> - * Set gpuvm_base and gpuvm_limit to CANONICAL addresses
>>>>>> so that they
>>>>>> - * are used in Thunk to reserve SVM.
>>>>>> - */
>>>>>> - pdd->gpuvm_base = SVM_USER_BASE;
>>>>>> + pdd->gpuvm_base = AMDGPU_VA_RESERVED_BOTTOM;
>>>>>> pdd->gpuvm_limit =
>>>>>> pdd->dev->kfd->shared_resources.gpuvm_size - 1;
>>>>>> pdd->scratch_base = MAKE_SCRATCH_APP_BASE_V9();
>>>>>> pdd->scratch_limit = MAKE_SCRATCH_APP_LIMIT(pdd->scratch_base);
>>>>>> +
>>>>>> + /*
>>>>>> + * Place TBA/TMA on opposite side of VM hole to prevent
>>>>>> + * stray faults from triggering SVM on these pages.
>>>>>> + */
>>>>>> + pdd->qpd.cwsr_base = AMDGPU_VA_RESERVED_TRAP_START(
>>>>>> + pdd->dev->adev->vm_manager.max_pfn <<
>>>>>> AMDGPU_GPU_PAGE_SHIFT);
>>>>>> }
>>>>>> int kfd_init_apertures(struct kfd_process *process)
>>>>>> @@ -407,12 +415,6 @@ int kfd_init_apertures(struct kfd_process
>>>>>> *process)
>>>>>> return -EINVAL;
>>>>>> }
>>>>>> }
>>>>>> -
>>>>>> - /* dGPUs: the reserved space for kernel
>>>>>> - * before SVM
>>>>>> - */
>>>>>> - pdd->qpd.cwsr_base = SVM_CWSR_BASE;
>>>>>> - pdd->qpd.ib_base = SVM_IB_BASE;
>>>>>> }
>>>>>> dev_dbg(kfd_device, "node id %u\n", id);
>>>>>
More information about the amd-gfx
mailing list