[PATCH v3 1/2] drm/amdgpu: Auto-validate DMABuf imports in compute VMs

Felix Kuehling felix.kuehling at amd.com
Wed Dec 20 16:04:44 UTC 2023


On 2023-12-20 8:58, Christian König wrote:
> Am 19.12.23 um 23:43 schrieb Felix Kuehling:
>> On 2023-12-19 3:10, Christian König wrote:
>>> Am 15.12.23 um 16:19 schrieb Felix Kuehling:
>>>>
>>>> On 2023-12-15 07:30, Christian König wrote:
>>>>>> @@ -1425,11 +1451,21 @@ int amdgpu_vm_handle_moved(struct 
>>>>>> amdgpu_device *adev,
>>>>>>           }
>>>>>>             r = amdgpu_vm_bo_update(adev, bo_va, clear);
>>>>>> -        if (r)
>>>>>> -            return r;
>>>>>>             if (unlock)
>>>>>>               dma_resv_unlock(resv);
>>>>>> +        if (r)
>>>>>> +            return r;
>>>>>> +
>>>>>> +        /* Remember evicted DMABuf imports in compute VMs for later
>>>>>> +         * validation
>>>>>> +         */
>>>>>> +        if (vm->is_compute_context && bo_va->base.bo &&
>>>>>> + bo_va->base.bo->tbo.base.import_attach &&
>>>>>> +            (!bo_va->base.bo->tbo.resource ||
>>>>>> + bo_va->base.bo->tbo.resource->mem_type == TTM_PL_SYSTEM))
>>>>>> +            amdgpu_vm_bo_evicted(&bo_va->base);
>>>>>> +
>>>>>
>>>>> The change looks mostly good now. Just one thing which worries me 
>>>>> is that when GFX and compute is mixed in the same VM this here 
>>>>> might cause problems when we run into an error during command 
>>>>> submission.
>>>>>
>>>>> E.g. GFX validates the VM BOs, but then the IOCTL fails before 
>>>>> calling amdgpu_vm_handle_moved().
>>>>>
>>>>> In this case the DMA-buf wouldn't be validated any more.
>>>>
>>>> This code path shouldn't be relevant for command submission, but 
>>>> for the amdgpu_vm_handle_moved call in amdgpu_dma_buf_move_notify. 
>>>> That's where the BO is first found to be evicted and its PTEs 
>>>> invalidated. That's where we need to remember it so it can be 
>>>> validated in the next call to amdgpu_vm_validate.
>>>>
>>>> Currently the amdgpu_cs code path calls amdgpu_vm_validate with 
>>>> ticket=NULL, so it won't validate these imports. The only place 
>>>> that auto-validates evicted imports is 
>>>> amdgpu_amdkfd_restore_process_bos. So none of this should affect 
>>>> amdgpu_cs command submission.
>>>
>>> Yeah, but ticket=NULL will result in removing those imports from the 
>>> validation list.
>>
>> I have a comment for that in amdgpu_vm_validate:
>>
>>                          if (!ticket) {
>>                                  /* We need to move the BO out of the evicted
>>                                   * list to avoid an infinite loop. It will be
>>                                   * moved back to evicted in the next
>>                                   * amdgpu_vm_handle_moved.
>>                                   */
>>                                  amdgpu_vm_bo_invalidated(bo_base);
>>                                  spin_lock(&vm->status_lock);
>>                                  continue;
>>                          }
>>
>> The net result is that the BO is still tracked as evicted.
>>
>
> Yeah, that's exactly what I mean:
>
> What happens if amdgpu_vm_validate() is called, removes the BOs from 
> the evicted list, but then an error happens (or just an interrupted 
> system call) and amdgpu_vm_handle_moved is never called?
>
> In this case the DMA-bufs would be on the moved list and 
> amdgpu_vm_handle moved would have to be called once before we can 
> validate them again.

And that would be a good argument for tracking evicted user BOs with a 
separate state after all. That way we could just leave that list alone 
if ticket==NULL in amdgpu_vm_validate.

Regards,
   Felix


>
> Regards,
> Christian.
>
>>
>>> This could potentially result in not validating them should we ever 
>>> mix GFX and Compute submissions in the same VM.
>>
>> My eviction test does exactly this, and it's working just fine.
>>
>> Regards,
>>   Felix
>>
>>
>>>
>>> For now that works, but we need to clean that up more thoughtfully I 
>>> think.
>>>
>>> Christian.
>>>
>>>>
>>>>
>>>> The flow for amdgpu_amdkfd_restore_process_bos is:
>>>>
>>>>  * amdgpu_vm_validate validates the evicted dmabuf imports (moves the
>>>>    bo_vas from "evicted" to "invalidated")
>>>>  * amdgpu_vm_handle_moved iterates over invalidated bo_vas and updates
>>>>    the PTEs with valid entries (moves the bo_vas to "done")
>>>>
>>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Christian. 
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231220/48b68b10/attachment.htm>


More information about the amd-gfx mailing list