[PATCH v3 1/2] drm/amdgpu: Auto-validate DMABuf imports in compute VMs

Christian König ckoenig.leichtzumerken at gmail.com
Wed Dec 20 13:58:56 UTC 2023


Am 19.12.23 um 23:43 schrieb Felix Kuehling:
> On 2023-12-19 3:10, Christian König wrote:
>> Am 15.12.23 um 16:19 schrieb Felix Kuehling:
>>>
>>> On 2023-12-15 07:30, Christian König wrote:
>>>>> @@ -1425,11 +1451,21 @@ int amdgpu_vm_handle_moved(struct 
>>>>> amdgpu_device *adev,
>>>>>           }
>>>>>             r = amdgpu_vm_bo_update(adev, bo_va, clear);
>>>>> -        if (r)
>>>>> -            return r;
>>>>>             if (unlock)
>>>>>               dma_resv_unlock(resv);
>>>>> +        if (r)
>>>>> +            return r;
>>>>> +
>>>>> +        /* Remember evicted DMABuf imports in compute VMs for later
>>>>> +         * validation
>>>>> +         */
>>>>> +        if (vm->is_compute_context && bo_va->base.bo &&
>>>>> +            bo_va->base.bo->tbo.base.import_attach &&
>>>>> +            (!bo_va->base.bo->tbo.resource ||
>>>>> + bo_va->base.bo->tbo.resource->mem_type == TTM_PL_SYSTEM))
>>>>> +            amdgpu_vm_bo_evicted(&bo_va->base);
>>>>> +
>>>>
>>>> The change looks mostly good now. Just one thing which worries me 
>>>> is that when GFX and compute is mixed in the same VM this here 
>>>> might cause problems when we run into an error during command 
>>>> submission.
>>>>
>>>> E.g. GFX validates the VM BOs, but then the IOCTL fails before 
>>>> calling amdgpu_vm_handle_moved().
>>>>
>>>> In this case the DMA-buf wouldn't be validated any more.
>>>
>>> This code path shouldn't be relevant for command submission, but for 
>>> the amdgpu_vm_handle_moved call in amdgpu_dma_buf_move_notify. 
>>> That's where the BO is first found to be evicted and its PTEs 
>>> invalidated. That's where we need to remember it so it can be 
>>> validated in the next call to amdgpu_vm_validate.
>>>
>>> Currently the amdgpu_cs code path calls amdgpu_vm_validate with 
>>> ticket=NULL, so it won't validate these imports. The only place that 
>>> auto-validates evicted imports is amdgpu_amdkfd_restore_process_bos. 
>>> So none of this should affect amdgpu_cs command submission.
>>
>> Yeah, but ticket=NULL will result in removing those imports from the 
>> validation list.
>
> I have a comment for that in amdgpu_vm_validate:
>
>                          if (!ticket) {
>                                  /* We need to move the BO out of the evicted
>                                   * list to avoid an infinite loop. It will be
>                                   * moved back to evicted in the next
>                                   * amdgpu_vm_handle_moved.
>                                   */
>                                  amdgpu_vm_bo_invalidated(bo_base);
>                                  spin_lock(&vm->status_lock);
>                                  continue;
>                          }
>
> The net result is that the BO is still tracked as evicted.
>

Yeah, that's exactly what I mean:

What happens if amdgpu_vm_validate() is called, removes the BOs from the 
evicted list, but then an error happens (or just an interrupted system 
call) and amdgpu_vm_handle_moved is never called?

In this case the DMA-bufs would be on the moved list and 
amdgpu_vm_handle moved would have to be called once before we can 
validate them again.

Regards,
Christian.

>
>> This could potentially result in not validating them should we ever 
>> mix GFX and Compute submissions in the same VM.
>
> My eviction test does exactly this, and it's working just fine.
>
> Regards,
>   Felix
>
>
>>
>> For now that works, but we need to clean that up more thoughtfully I 
>> think.
>>
>> Christian.
>>
>>>
>>>
>>> The flow for amdgpu_amdkfd_restore_process_bos is:
>>>
>>>  * amdgpu_vm_validate validates the evicted dmabuf imports (moves the
>>>    bo_vas from "evicted" to "invalidated")
>>>  * amdgpu_vm_handle_moved iterates over invalidated bo_vas and updates
>>>    the PTEs with valid entries (moves the bo_vas to "done")
>>>
>>>
>>> Regards,
>>>   Felix
>>>
>>>
>>>>
>>>> Regards,
>>>> Christian. 
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231220/d66ef380/attachment.htm>


More information about the amd-gfx mailing list