[PATCH v3 1/2] drm/amdgpu: Auto-validate DMABuf imports in compute VMs

Felix Kuehling felix.kuehling at amd.com
Tue Dec 19 22:43:20 UTC 2023


On 2023-12-19 3:10, Christian König wrote:
> Am 15.12.23 um 16:19 schrieb Felix Kuehling:
>>
>> On 2023-12-15 07:30, Christian König wrote:
>>>> @@ -1425,11 +1451,21 @@ int amdgpu_vm_handle_moved(struct 
>>>> amdgpu_device *adev,
>>>>           }
>>>>             r = amdgpu_vm_bo_update(adev, bo_va, clear);
>>>> -        if (r)
>>>> -            return r;
>>>>             if (unlock)
>>>>               dma_resv_unlock(resv);
>>>> +        if (r)
>>>> +            return r;
>>>> +
>>>> +        /* Remember evicted DMABuf imports in compute VMs for later
>>>> +         * validation
>>>> +         */
>>>> +        if (vm->is_compute_context && bo_va->base.bo &&
>>>> +            bo_va->base.bo->tbo.base.import_attach &&
>>>> +            (!bo_va->base.bo->tbo.resource ||
>>>> + bo_va->base.bo->tbo.resource->mem_type == TTM_PL_SYSTEM))
>>>> +            amdgpu_vm_bo_evicted(&bo_va->base);
>>>> +
>>>
>>> The change looks mostly good now. Just one thing which worries me is 
>>> that when GFX and compute is mixed in the same VM this here might 
>>> cause problems when we run into an error during command submission.
>>>
>>> E.g. GFX validates the VM BOs, but then the IOCTL fails before 
>>> calling amdgpu_vm_handle_moved().
>>>
>>> In this case the DMA-buf wouldn't be validated any more.
>>
>> This code path shouldn't be relevant for command submission, but for 
>> the amdgpu_vm_handle_moved call in amdgpu_dma_buf_move_notify. That's 
>> where the BO is first found to be evicted and its PTEs invalidated. 
>> That's where we need to remember it so it can be validated in the 
>> next call to amdgpu_vm_validate.
>>
>> Currently the amdgpu_cs code path calls amdgpu_vm_validate with 
>> ticket=NULL, so it won't validate these imports. The only place that 
>> auto-validates evicted imports is amdgpu_amdkfd_restore_process_bos. 
>> So none of this should affect amdgpu_cs command submission.
>
> Yeah, but ticket=NULL will result in removing those imports from the 
> validation list.

I have a comment for that in amdgpu_vm_validate:

                         if (!ticket) {
                                 /* We need to move the BO out of the evicted
                                  * list to avoid an infinite loop. It will be
                                  * moved back to evicted in the next
                                  * amdgpu_vm_handle_moved.
                                  */
                                 amdgpu_vm_bo_invalidated(bo_base);
                                 spin_lock(&vm->status_lock);
                                 continue;
                         }

The net result is that the BO is still tracked as evicted.


> This could potentially result in not validating them should we ever 
> mix GFX and Compute submissions in the same VM.

My eviction test does exactly this, and it's working just fine.

Regards,
   Felix


>
> For now that works, but we need to clean that up more thoughtfully I 
> think.
>
> Christian.
>
>>
>>
>> The flow for amdgpu_amdkfd_restore_process_bos is:
>>
>>  * amdgpu_vm_validate validates the evicted dmabuf imports (moves the
>>    bo_vas from "evicted" to "invalidated")
>>  * amdgpu_vm_handle_moved iterates over invalidated bo_vas and updates
>>    the PTEs with valid entries (moves the bo_vas to "done")
>>
>>
>> Regards,
>>   Felix
>>
>>
>>>
>>> Regards,
>>> Christian. 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231219/bb750e7c/attachment.htm>


More information about the amd-gfx mailing list