[PATCH] Revert "drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole"

Wed Jan 3 18:58:22 UTC 2024

On 2024-01-03 10:32, Jay Cornwall wrote:
> On 1/3/2024 09:19, Alex Deucher wrote:
>> + Jay, Felix
>>
>> On Wed, Jan 3, 2024 at 5:16 AM Kaibo Ma <ent3rm4n at gmail.com> wrote:
>>> That commit causes NULL pointer dereferences in dmesgs when
>>> running applications using ROCm, including clinfo, blender,
>>> and PyTorch, since v6.6.1. Revert it to fix blender again.
>>>
>>> This reverts commit 96c211f1f9ef82183493f4ceed4e347b52849149.
>>>
>>> Closes: https://github.com/ROCm/ROCm/issues/2596
>>> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2991
>>> Signed-off-by: Kaibo Ma <ent3rm4n at gmail.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c | 26 ++++++++++----------
>>>   1 file changed, 13 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c b/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
>>> index 62b205dac..6604a3f99 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
>>> @@ -330,12 +330,6 @@ static void kfd_init_apertures_vi(struct kfd_process_device *pdd, uint8_t id)
>>>          pdd->gpuvm_limit =
>>>                  pdd->dev->kfd->shared_resources.gpuvm_size - 1;
>>>
>>> -       /* dGPUs: the reserved space for kernel
>>> -        * before SVM
>>> -        */
>>> -       pdd->qpd.cwsr_base = SVM_CWSR_BASE;
>>> -       pdd->qpd.ib_base = SVM_IB_BASE;
>>> -
>>>          pdd->scratch_base = MAKE_SCRATCH_APP_BASE_VI();
>>>          pdd->scratch_limit = MAKE_SCRATCH_APP_LIMIT(pdd->scratch_base);
>>>   }
>>> @@ -345,18 +339,18 @@ static void kfd_init_apertures_v9(struct kfd_process_device *pdd, uint8_t id)
>>>          pdd->lds_base = MAKE_LDS_APP_BASE_V9();
>>>          pdd->lds_limit = MAKE_LDS_APP_LIMIT(pdd->lds_base);
>>>
>>> -       pdd->gpuvm_base = PAGE_SIZE;
>>> +        /* Raven needs SVM to support graphic handle, etc. Leave the small
>>> +         * reserved space before SVM on Raven as well, even though we don't
>>> +         * have to.
>>> +         * Set gpuvm_base and gpuvm_limit to CANONICAL addresses so that they
>>> +         * are used in Thunk to reserve SVM.
>>> +         */
>>> +        pdd->gpuvm_base = SVM_USER_BASE;
>>>          pdd->gpuvm_limit =
>>>                  pdd->dev->kfd->shared_resources.gpuvm_size - 1;
>>>
>>>          pdd->scratch_base = MAKE_SCRATCH_APP_BASE_V9();
>>>          pdd->scratch_limit = MAKE_SCRATCH_APP_LIMIT(pdd->scratch_base);
>>> -
>>> -       /*
>>> -        * Place TBA/TMA on opposite side of VM hole to prevent
>>> -        * stray faults from triggering SVM on these pages.
>>> -        */
>>> -       pdd->qpd.cwsr_base = pdd->dev->kfd->shared_resources.gpuvm_size;
>>>   }
>>>
>>>   int kfd_init_apertures(struct kfd_process *process)
>>> @@ -413,6 +407,12 @@ int kfd_init_apertures(struct kfd_process *process)
>>>                                          return -EINVAL;
>>>                                  }
>>>                          }
>>> +
>>> +                        /* dGPUs: the reserved space for kernel
>>> +                         * before SVM
>>> +                         */
>>> +                        pdd->qpd.cwsr_base = SVM_CWSR_BASE;
>>> +                        pdd->qpd.ib_base = SVM_IB_BASE;
>>>                  }
>>>
>>>                  dev_dbg(kfd_device, "node id %u\n", id);
>>> --
>>> 2.42.0
>>>
> I saw a segfault issue in Mesa yesterday. Not sure about the others, but I don't know how to make this change while compatibility with older UMDs.

A segfault in Mesa seems to be a different issue from what's mentioned 
in the commit message. I'd let Christian or Marek comment on 
compatibility with graphics UMDs. I'm not sure why this patch would 
affect them at all.

Looking at the logs in the tickets, it looks like a fence reference 
counting error. I don't see how Jay's patch could have caused that. I 
made another change in that code recently that could make a difference 
for this issue:

    commit 8f08c5b24ced1be7eb49692e4816c1916233c79b
    Author: Felix Kuehling <Felix.Kuehling at amd.com>
    Date:   Fri Oct 27 18:21:55 2023 -0400

         drm/amdkfd: Run restore_workers on freezable WQs

         Make restore workers freezable so we don't have to explicitly
    flush them
         in suspend and GPU reset code paths, and we don't accidentally
    try to
         restore BOs while the GPU is suspended. Not having to flush
    restore_work
         also helps avoid lock/fence dependencies in the GPU reset case
    where we're
         not allowed to wait for fences.

         A side effect of this is, that we can now have multiple
    concurrent threads
         trying to signal the same eviction fence. Rework eviction fence
    signaling
         and replacement to account for that.

         The GPU reset path can no longer rely on restore_process_worker
    to resume
         queues because evict/restore workers can run independently of
    it. Instead
         call a new restore_process_helper directly.

         This is an RFC and request for testing.

         v2:
         - Reworked eviction fence signaling
         - Introduced restore_process_helper

         v3:
         - Handle unsignaled eviction fences in restore_process_bos

         Signed-off-by: Felix Kuehling <Felix.Kuehling at amd.com>
         Acked-by: Christian König <christian.koenig at amd.com>
         Tested-by: Emily Deng <Emily.Deng at amd.com>
         Signed-off-by: Alex Deucher <alexander.deucher at amd.com>

FWIW, I built a plain 6.6 kernel, and was not able to reproduce the 
crash with some simple tests.

Regards,
   Felix

>
> So I agree, let's revert it.
>
> Reviewed-by: Jay Cornwall <jay.cornwall at amd.com>