drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole (v2)

Mon Feb 5 17:12:27 UTC 2024

[Public]

> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
> Shengyu Qu
> Sent: Saturday, February 3, 2024 8:05 AM
> To: Kuehling, Felix <Felix.Kuehling at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: wiagn233 at outlook.com; Cornwall, Jay <Jay.Cornwall at amd.com>;
> Koenig, Christian <Christian.Koenig at amd.com>; Paneer Selvam, Arunpravin
> <Arunpravin.PaneerSelvam at amd.com>
> Subject: Re: drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole (v2)
>
> Hi Felix,
> Sorry for my late reply. I was busy this week.
> I just did some more tests using next-20240202 branch. Testing using blender
> 4.0.2, when only one HIP render task is running, there's no problem.
> However, when two tasks run together, software always crashes, but not
> crashes the whole system. Dmesg reports gpu reset in most cases, for
> example:
>
> [  176.071823] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
> gfx_0.0.0 timeout, signaled seq=32608, emitted seq=32610 [  176.072000]
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
> information: process blender pid 4256 thread blender:cs0 pid 4297
> [  176.072143] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
> [  176.073571] amdgpu 0000:03:00.0: amdgpu: Guilty job already signaled,
> skipping HW reset [  176.073593] amdgpu 0000:03:00.0: amdgpu: GPU
> reset(4) succeeded!
>
> And in some rare cases, there would be a page fault report, see dmesg.log.
> Do you have any idea? Can I make it print more detailed diagnostic
> information?

Are you only seeing the problem with this patch applied or in general?  If you are seeing it in general, it likely related to a firmware issue that was recently fixed that will be resolved with an update CP firmware image.
Driver side changes:
https://gitlab.freedesktop.org/agd5f/linux/-/commit/0eb6c664b780dd1b4080e047ad51b100cd7840a3
https://gitlab.freedesktop.org/agd5f/linux/-/commit/40970e60070ed3d1390ec65e38e819f6d81b8f0c

Alex

>
> Best regards,
> Shengyu
>
>
> 在 2024/1/30 01:47, Felix Kuehling 写道:
> > On 2024-01-29 10:24, Shengyu Qu wrote:
> >> Hello Felix,
> >> I think you are right. This problem has existed for years(just look
> >> at the issue creation time in my link), and is thought caused by
> >> OpenGL-ROCM interop(that's why I think this patch might help). It is
> >> very easy to trigger this problem in blender(method is also mentioned
> >> in the link).
> >
> > This doesn't help you, but it's unlikely that this has been the same
> > issue for two years for everybody who chimed into this bug report.
> > Different kernel versions, GPUs, user mode ROCm and Mesa versions etc.
> >
> > Case in point, it's possible that you're seeing an issue specific to
> > RDNA3, which hasn't even been around for that long.
> >
> >
> >> Do
> >> you have any idea about this?
> >
> > Not without seeing a lot more diagnostic information. A full backtrace
> > from your kernel log would be a good start.
> >
> > Regards,
> >   Felix
> >
> >
> >> Best regards,
> >> Shengyu
> >> 在 2024/1/29 22:51, Felix Kuehling 写道:
> >>> On 2024-01-29 8:58, Shengyu Qu wrote:
> >>>> Hi,
> >>>> Seems rocm-opengl interop hang problem still exists[1]. Btw have
> >>>> you discovered into this problem?
> >>>> Best regards,
> >>>> Shengyu
> >>>> [1]
> >>>> https://projects.blender.org/blender/blender/issues/100353#issuecom
> >>>> ment-1111599
> >>>
> >>> Maybe you're having a different problem. Do you see this issue also
> >>> without any version of the "Relocate TBA/TMA ..." patch?
> >>>
> >>> Regards,
> >>>   Felix
> >>>
> >>>
> >>>>
> >>>> 在 2024/1/27 03:15, Shengyu Qu 写道:
> >>>>> Hello Felix,
> >>>>> This patch seems working on my system, also it seems fixes the
> >>>>> ROCM/OpenGL interop problem.
> >>>>> Is this intended to happen or not? Maybe we need more users to
> >>>>> test it.
> >>>>> Besides,
> >>>>> Tested-by: Shengyu Qu <wiagn233 at outlook.com> Best Regards,
> Shengyu
> >>>>>
> >>>>> 在 2024/1/26 06:27, Felix Kuehling 写道:
> >>>>>> The TBA and TMA, along with an unused IB allocation, reside at
> >>>>>> low addresses in the VM address space. A stray VM fault which
> >>>>>> hits these pages must be serviced by making their page table entries
> invalid.
> >>>>>> The scheduler depends upon these pages being resident and fails,
> >>>>>> preventing a debugger from inspecting the failure state.
> >>>>>>
> >>>>>> By relocating these pages above 47 bits in the VM address space
> >>>>>> they can only be reached when bits [63:48] are set to 1. This
> >>>>>> makes it much less likely for a misbehaving program to generate
> >>>>>> accesses to them.
> >>>>>> The current placement at VA (PAGE_SIZE*2) is readily hit by a
> >>>>>> NULL access with a small offset.
> >>>>>>
> >>>>>> v2:
> >>>>>> - Move it to the reserved space to avoid concflicts with Mesa
> >>>>>> - Add macros to make reserved space management easier
> >>>>>>
> >>>>>> Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam at amd.com>
> >>>>>> Cc: Christian Koenig <christian.koenig at amd.com>
> >>>>>> Signed-off-by: Jay Cornwall <jay.cornwall at amd.com>
> >>>>>> Signed-off-by: Felix Kuehling <felix.kuehling at amd.com>
> >>>>>> ---
> >>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c      |  4 +--
> >>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c    |  7 ++---
> >>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h       | 12 ++++++--
> >>>>>>   drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c | 30
> >>>>>> +++++++++++---------
> >>>>>>   4 files changed, 30 insertions(+), 23 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> >>>>>> index 823d31f4a2a3..53d0a458d78e 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
> >>>>>> @@ -28,9 +28,9 @@
> >>>>>>     uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)
> >>>>>>   {
> >>>>>> -    uint64_t addr = adev->vm_manager.max_pfn <<
> >>>>>> AMDGPU_GPU_PAGE_SHIFT;
> >>>>>> +    uint64_t addr = AMDGPU_VA_RESERVED_CSA_START(
> >>>>>> +        adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT);
> >>>>>>   -    addr -= AMDGPU_VA_RESERVED_CSA_SIZE;
> >>>>>>       addr = amdgpu_gmc_sign_extend(addr);
> >>>>>>         return addr;
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c
> >>>>>> index 3d0d56087d41..9e769ef50f2e 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c
> >>>>>> @@ -45,11 +45,8 @@
> >>>>>>    */
> >>>>>>   static inline u64 amdgpu_seq64_get_va_base(struct amdgpu_device
> >>>>>> *adev)
> >>>>>>   {
> >>>>>> -    u64 addr = adev->vm_manager.max_pfn <<
> >>>>>> AMDGPU_GPU_PAGE_SHIFT;
> >>>>>> -
> >>>>>> -    addr -= AMDGPU_VA_RESERVED_TOP;
> >>>>>> -
> >>>>>> -    return addr;
> >>>>>> +    return AMDGPU_VA_RESERVED_SEQ64_START(
> >>>>>> +        adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT);
> >>>>>>   }
> >>>>>>     /**
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> >>>>>> index 666698a57192..f23b6153d310 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> >>>>>> @@ -135,11 +135,19 @@ struct amdgpu_mem_stats;
> >>>>>>   #define AMDGPU_IS_MMHUB0(x) ((x) >=
> AMDGPU_MMHUB0_START && (x)
> >>>>>> < AMDGPU_MMHUB1_START)
> >>>>>>   #define AMDGPU_IS_MMHUB1(x) ((x) >=
> AMDGPU_MMHUB1_START && (x)
> >>>>>> < AMDGPU_MAX_VMHUBS)
> >>>>>>   -/* Reserve 2MB at top/bottom of address space for kernel use
> >>>>>> */
> >>>>>> +/* Reserve space at top/bottom of address space for kernel use
> >>>>>> +*/
> >>>>>>   #define AMDGPU_VA_RESERVED_CSA_SIZE        (2ULL << 20)
> >>>>>> +#define AMDGPU_VA_RESERVED_CSA_START(top)    ((top) \
> >>>>>> +                         - AMDGPU_VA_RESERVED_CSA_SIZE)
> >>>>>>   #define AMDGPU_VA_RESERVED_SEQ64_SIZE        (2ULL << 20)
> >>>>>> +#define AMDGPU_VA_RESERVED_SEQ64_START(top)
> >>>>>> (AMDGPU_VA_RESERVED_CSA_START(top) \
> >>>>>> +                         - AMDGPU_VA_RESERVED_SEQ64_SIZE)
> >>>>>> +#define AMDGPU_VA_RESERVED_TRAP_SIZE        (2ULL << 12) #define
> >>>>>> +AMDGPU_VA_RESERVED_TRAP_START(top)
> >>>>>> (AMDGPU_VA_RESERVED_SEQ64_START(top) \
> >>>>>> +                         - AMDGPU_VA_RESERVED_TRAP_SIZE)
> >>>>>>   #define AMDGPU_VA_RESERVED_BOTTOM        (2ULL << 20) -#define
> >>>>>> AMDGPU_VA_RESERVED_TOP (AMDGPU_VA_RESERVED_SEQ64_SIZE
> + \
> >>>>>> +#define AMDGPU_VA_RESERVED_TOP
> (AMDGPU_VA_RESERVED_TRAP_SIZE + \
> >>>>>> +                         AMDGPU_VA_RESERVED_SEQ64_SIZE + \
> >>>>>>                            AMDGPU_VA_RESERVED_CSA_SIZE)
> >>>>>>     /* See vm_update_mode */
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
> >>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
> >>>>>> index 6604a3f99c5e..f899cce25b2a 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
> >>>>>> @@ -36,6 +36,7 @@
> >>>>>>   #include <linux/mm.h>
> >>>>>>   #include <linux/mman.h>
> >>>>>>   #include <linux/processor.h>
> >>>>>> +#include "amdgpu_vm.h"
> >>>>>>     /*
> >>>>>>    * The primary memory I/O features being added for revisions of
> >>>>>> gfxip @@ -326,10 +327,16 @@ static void
> >>>>>> kfd_init_apertures_vi(struct kfd_process_device *pdd, uint8_t id)
> >>>>>>        * with small reserved space for kernel.
> >>>>>>        * Set them to CANONICAL addresses.
> >>>>>>        */
> >>>>>> -    pdd->gpuvm_base = SVM_USER_BASE;
> >>>>>> +    pdd->gpuvm_base = max(SVM_USER_BASE,
> >>>>>> AMDGPU_VA_RESERVED_BOTTOM);
> >>>>>>       pdd->gpuvm_limit =
> >>>>>> pdd->dev->kfd->shared_resources.gpuvm_size - 1;
> >>>>>>   +    /* dGPUs: the reserved space for kernel
> >>>>>> +     * before SVM
> >>>>>> +     */
> >>>>>> +    pdd->qpd.cwsr_base = SVM_CWSR_BASE;
> >>>>>> +    pdd->qpd.ib_base = SVM_IB_BASE;
> >>>>>> +
> >>>>>>       pdd->scratch_base = MAKE_SCRATCH_APP_BASE_VI();
> >>>>>>       pdd->scratch_limit =
> >>>>>> MAKE_SCRATCH_APP_LIMIT(pdd->scratch_base);
> >>>>>>   }
> >>>>>> @@ -339,18 +346,19 @@ static void kfd_init_apertures_v9(struct
> >>>>>> kfd_process_device *pdd, uint8_t id)
> >>>>>>       pdd->lds_base = MAKE_LDS_APP_BASE_V9();
> >>>>>>       pdd->lds_limit = MAKE_LDS_APP_LIMIT(pdd->lds_base);
> >>>>>>   -        /* Raven needs SVM to support graphic handle, etc.
> >>>>>> Leave the small
> >>>>>> -         * reserved space before SVM on Raven as well, even
> >>>>>> though we don't
> >>>>>> -         * have to.
> >>>>>> -         * Set gpuvm_base and gpuvm_limit to CANONICAL addresses
> >>>>>> so that they
> >>>>>> -         * are used in Thunk to reserve SVM.
> >>>>>> -         */
> >>>>>> -        pdd->gpuvm_base = SVM_USER_BASE;
> >>>>>> +    pdd->gpuvm_base = AMDGPU_VA_RESERVED_BOTTOM;
> >>>>>>       pdd->gpuvm_limit =
> >>>>>> pdd->dev->kfd->shared_resources.gpuvm_size - 1;
> >>>>>>         pdd->scratch_base = MAKE_SCRATCH_APP_BASE_V9();
> >>>>>>       pdd->scratch_limit =
> >>>>>> MAKE_SCRATCH_APP_LIMIT(pdd->scratch_base);
> >>>>>> +
> >>>>>> +    /*
> >>>>>> +     * Place TBA/TMA on opposite side of VM hole to prevent
> >>>>>> +     * stray faults from triggering SVM on these pages.
> >>>>>> +     */
> >>>>>> +    pdd->qpd.cwsr_base = AMDGPU_VA_RESERVED_TRAP_START(
> >>>>>> +        pdd->dev->adev->vm_manager.max_pfn <<
> >>>>>> AMDGPU_GPU_PAGE_SHIFT);
> >>>>>>   }
> >>>>>>     int kfd_init_apertures(struct kfd_process *process) @@
> >>>>>> -407,12 +415,6 @@ int kfd_init_apertures(struct kfd_process
> >>>>>> *process)
> >>>>>>                       return -EINVAL;
> >>>>>>                   }
> >>>>>>               }
> >>>>>> -
> >>>>>> -                        /* dGPUs: the reserved space for kernel
> >>>>>> -                         * before SVM
> >>>>>> -                         */
> >>>>>> -                        pdd->qpd.cwsr_base = SVM_CWSR_BASE;
> >>>>>> -                        pdd->qpd.ib_base = SVM_IB_BASE;
> >>>>>>           }
> >>>>>>             dev_dbg(kfd_device, "node id %u\n", id);
> >>>>>