[PATCH 2/6] drm/amdgpu: New VM state for evicted user BOs
Felix Kuehling
felix.kuehling at amd.com
Tue Nov 14 22:26:02 UTC 2023
On 2023-11-09 03:12, Christian König wrote:
> Am 08.11.23 um 22:23 schrieb Felix Kuehling:
>>
>> On 2023-11-08 07:28, Christian König wrote:
>>> Not necessary objections to this patch here, but rather how this new
>>> state is used later on.
>>>
>>> The fundamental problem is that re-validating things in
>>> amdgpu_vm_handle_moved() won't work in all cases.
>>>
>>> The general approach for both classic CS IOCTL as well as user
>>> queues is the following:
>>>
>>> 1. Grab all the necessary locks
>>> The VM lock + either everything inside the VM (user queues) or
>>> just the BOs in the submission (CS IOCTL).
>>>
>>> 2. Validate everything locked
>>> It can be that additional BO locks are grabbed for eviction and
>>> even locked BOs are shuffled around during that.
>>>
>>> 3. Update the PDEs and PTEs
>>> This can be done only after validating everything because things
>>> can still shuffle around during validation.
>>>
>>> 4. Do your CS or make the userqueu / process runable etc...
>>>
>>> 5. Attach fences to the locked BOs.
>>>
>>> 6. Unlock everything again.
>>>
>>> I think that is part of the problem why handling all updates in
>>> amdgpu_vm_handle_moved() for the CS doesn't work and sooner or later
>>> you might run into the same issue during process restore as well.
>>>
>>> If I'm not completely mistaken this should be solvable by moving all
>>> validation to amdgpu_vm_validate_pt_bos() instead (but let us rename
>>> that function).
>>
>> I'm trying to understand what the problem is. As far as I can tell,
>> amdgpu_vm_handle_moved just runs a bit later in the CS and
>> process-restore code than amdgpu_vm_validate_pt_bos. But it runs with
>> all the same reservations locked. My current implementation in
>> amdgpu_vm_handle_moved has some failure cases when reservations
>> cannot be acquired. I think your drm_exec patch series would make it
>> possible to handle this more robustly. That said, in case of KFD
>> process restore, we can retry in case of failures already.
>>
>> Anyway, moving my re-validation code to a renamed version of
>> amdgpu_vm_validate_pt_bos shouldn't be a big problem. I just don't
>> understand what problem that solves. Maybe it just makes the code
>> cleaner to handle both evicted states in one place? Or maybe you're
>> pointing to a way to merge the two states into one?
>
> Well yeah merging both state into one might be nice to have, but that
> isn't the fundamental problem.
>
> What basically happens during validation of a BO is that this BO is
> move to a desired place (as described by the placement parameter).
> During this it can happen that we shuffle out other BOs.
>
> Now the issue is that when you validate in amdgpu_vm_handle_moved() it
> can be that by validating BO we shuffle out other BOs which then end
> up of the evicted list again.
>
> This most likely won't happen with KFD, but for GFX CS it's certainly
> possible. So the idea is that we first validate everything and then
> update all the page tables and not inter mix the two operations.
I have something working now. But I think the fundamental problem is
what you say, that BOs get can evicted again before we update the page
tables. It won't be a problem for KFD, because I only care about DMABuf
imports, and they don't get evicted if I have the original allocations
reserved. I just have to make sure I validate/move the original
allocations before I validate their DMABuf imports. This could still be
a problem for future graphics interop.
The problem is, that I don't currently have a way to keep the validated
BOs reserved to prevent their eviction. That is on amd-staging-drm-next,
where I don't have your drm_exec patch series yet. Maybe this will get
easier once the branch gets rebased on Linux 6.5 or newer.
Regards,
Felix
>
> Regards,
> Christian.
>
>>
>> Regards,
>> Felix
>>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 07.11.23 um 23:11 schrieb Felix Kuehling:
>>>> Hi Christian,
>>>>
>>>> I know you have objected to this patch before. I still think this
>>>> is the best solution for what I need. I can talk you through my
>>>> reasoning by email or offline. If I can't convince you, I will need
>>>> your guidance for a better solution.
>>>>
>>>> Thanks,
>>>> Felix
>>>>
>>>>
>>>> On 2023-11-07 11:58, Felix Kuehling wrote:
>>>>> Create a new VM state to track user BOs that are in the system
>>>>> domain.
>>>>> In the next patch this will be used do conditionally re-validate
>>>>> them in
>>>>> amdgpu_vm_handle_moved.
>>>>>
>>>>> Signed-off-by: Felix Kuehling <Felix.Kuehling at amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 +++++++++++++++++
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 5 ++++-
>>>>> 2 files changed, 21 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> index 1442d97ddd0f..0d685577243c 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> @@ -232,6 +232,22 @@ static void amdgpu_vm_bo_invalidated(struct
>>>>> amdgpu_vm_bo_base *vm_bo)
>>>>> spin_unlock(&vm_bo->vm->status_lock);
>>>>> }
>>>>> +/**
>>>>> + * amdgpu_vm_bo_evicted_user - vm_bo is evicted
>>>>> + *
>>>>> + * @vm_bo: vm_bo which is evicted
>>>>> + *
>>>>> + * State for BOs used by user mode queues which are not at the
>>>>> location they
>>>>> + * should be.
>>>>> + */
>>>>> +static void amdgpu_vm_bo_evicted_user(struct amdgpu_vm_bo_base
>>>>> *vm_bo)
>>>>> +{
>>>>> + vm_bo->moved = true;
>>>>> + spin_lock(&vm_bo->vm->status_lock);
>>>>> + list_move(&vm_bo->vm_status, &vm_bo->vm->evicted_user);
>>>>> + spin_unlock(&vm_bo->vm->status_lock);
>>>>> +}
>>>>> +
>>>>> /**
>>>>> * amdgpu_vm_bo_relocated - vm_bo is reloacted
>>>>> *
>>>>> @@ -2110,6 +2126,7 @@ int amdgpu_vm_init(struct amdgpu_device
>>>>> *adev, struct amdgpu_vm *vm, int32_t xcp
>>>>> for (i = 0; i < AMDGPU_MAX_VMHUBS; i++)
>>>>> vm->reserved_vmid[i] = NULL;
>>>>> INIT_LIST_HEAD(&vm->evicted);
>>>>> + INIT_LIST_HEAD(&vm->evicted_user);
>>>>> INIT_LIST_HEAD(&vm->relocated);
>>>>> INIT_LIST_HEAD(&vm->moved);
>>>>> INIT_LIST_HEAD(&vm->idle);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>> index 12cac06fa289..939d0c2219c0 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>> @@ -286,9 +286,12 @@ struct amdgpu_vm {
>>>>> /* Lock to protect vm_bo add/del/move on all lists of vm */
>>>>> spinlock_t status_lock;
>>>>> - /* BOs who needs a validation */
>>>>> + /* Per VM and PT BOs who needs a validation */
>>>>> struct list_head evicted;
>>>>> + /* BOs for user mode queues that need a validation */
>>>>> + struct list_head evicted_user;
>>>>> +
>>>>> /* PT BOs which relocated and their parent need an update */
>>>>> struct list_head relocated;
>>>
>
More information about the amd-gfx
mailing list