[PATCH] drm/amdkfd: make sure VM is ready for updating operations

Felix Kuehling felix.kuehling at amd.com
Wed Apr 10 00:32:26 UTC 2024


On 2024-04-08 3:55, Christian König wrote:
> Am 07.04.24 um 06:52 schrieb Lang Yu:
>> When VM is in evicting state, amdgpu_vm_update_range would return 
>> -EBUSY.
>> Then restore_process_worker runs into a dead loop.
>>
>> Fixes: 2fdba514ad5a ("drm/amdgpu: Auto-validate DMABuf imports in 
>> compute VMs")
>
> Mhm, while it would be good to have this case handled as error it 
> should never occur in practice since we should have validated the VM 
> before validating the DMA-bufs.
>
> @Felix isn't that something we have taken care of?

The problem I saw when I implemented Auto-validate was, that migration 
of a BO invalidates its DMABuf attachments. So I need to validate the 
DMABuf attachments after validating the BOs they attach to. This 
auto-validation happens in amdgpu_vm_validate. So I needed to do the VM 
validation after the BO validation. The problem now seems to be that the 
BO validation happens in the same loop as the page table update. And the 
page table update fails if the VM is not valid.

I never saw this problem in my testing, probably because I never got my 
page tables evicted?

Anyway, I think the solution is to split the BO validation and page 
table update into two separate loops in amdgpu_amdkfd_restore_process_pos:

 1. Validate BOs
 2. Validate VM (and DMABuf attachments)
 3. Update page tables for the BOs validated above

Regards,
   Felix


>
> Regards,
> Christian.
>
>
>>
>> Signed-off-by: Lang Yu <Lang.Yu at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 0ae9fd844623..8c71fe07807a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -2900,6 +2900,12 @@ int 
>> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence 
>> __rcu *
>>         amdgpu_sync_create(&sync_obj);
>>   +    ret = process_validate_vms(process_info, NULL);
>> +    if (ret) {
>> +        pr_debug("Validating VMs failed, ret: %d\n", ret);
>> +        goto validate_map_fail;
>> +    }
>> +
>>       /* Validate BOs and map them to GPUVM (update VM page tables). */
>>       list_for_each_entry(mem, &process_info->kfd_bo_list,
>>                   validate_list) {
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20240409/15a5b0c1/attachment.htm>


More information about the amd-gfx mailing list