[Nouveau] [PATCH drm-misc-next v3 6/7] drm/gpuvm: generalize dma_resv/extobj handling and GEM validation
Thomas Hellström
thomas.hellstrom at linux.intel.com
Wed Sep 20 08:29:48 UTC 2023
On 9/20/23 09:44, Thomas Hellström wrote:
> Hi,
>
> On 9/20/23 07:37, Christian König wrote:
>> Am 19.09.23 um 17:23 schrieb Thomas Hellström:
>>>
>>> On 9/19/23 17:16, Danilo Krummrich wrote:
>>>> On 9/19/23 14:21, Thomas Hellström wrote:
>>>>> Hi Christian
>>>>>
>>>>> On 9/19/23 14:07, Christian König wrote:
>>>>>> Am 13.09.23 um 17:46 schrieb Danilo Krummrich:
>>>>>>> On 9/13/23 17:33, Christian König wrote:
>>>>>>>> Am 13.09.23 um 17:15 schrieb Danilo Krummrich:
>>>>>>>>> On 9/13/23 16:26, Christian König wrote:
>>>>>>>>>> Am 13.09.23 um 14:16 schrieb Danilo Krummrich:
>>>>>>>>>>> As mentioned in a different mail thread, the reply is based
>>>>>>>>>>> on the assumption
>>>>>>>>>>> that we don't support anything else than GPUVM updates from
>>>>>>>>>>> the IOCTL.
>>>>>>>>>>
>>>>>>>>>> I think that this assumption is incorrect.
>>>>>>>>>
>>>>>>>>> Well, more precisely I should have said "don't support GPUVM
>>>>>>>>> updated from within
>>>>>>>>> fence signaling critical sections". And looking at the code,
>>>>>>>>> that doesn't seem what
>>>>>>>>> you're doing there.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Vulkan is just once specific use case, but this here should
>>>>>>>>>> probably be able to handle other use cases as well.
>>>>>>>>>>
>>>>>>>>>> Especially with HMM you get the requirement that you need to
>>>>>>>>>> be able to invalidate GPUVM mappings without grabbing a
>>>>>>>>>> reservation lock.
>>>>>>>>>
>>>>>>>>> What do you mean with "invalidate GPUVM mappings" in this
>>>>>>>>> context? drm_gpuvm_bo_evict()
>>>>>>>>> should only be called from a ttm_device_funcs::move callback,
>>>>>>>>> we should hold the dma-resv
>>>>>>>>> lock there.
>>>>>>>>
>>>>>>>> Well the question is which dma-resv lock do we hold?
>>>>>>>>
>>>>>>>> In the move callback we only hold the dma-resv lock of the BO
>>>>>>>> which is moved, but when that is a shared BO then that's not
>>>>>>>> the same as the one for the VM.
>>>>>>>
>>>>>>> Correct, Thomas' idea was to use the GEM's dma_resv lock to
>>>>>>> protect drm_gpuvm_bo::evicted
>>>>>>> and then actually move the drm_gpuvm_bo to the VM's evicted list
>>>>>>> once we grabbed all
>>>>>>> dma-resv locks when locking the VM's BOs using drm_exec. We can
>>>>>>> remove them from the evicted
>>>>>>> list on validate(). This way we never touch the evicted list
>>>>>>> without holding at least the VM's
>>>>>>> dma-resv lock.
>>>>>>>
>>>>>>> Do you have any concerns about that?
>>>>>>
>>>>>> Scratching my head a bit how that is supposed to work.
>>>>>>
>>>>>> This implies that you go over all the evicted BOs during
>>>>>> validation and not just the one mentioned in the CS.
>>>>>>
>>>>>> That might work for Vulkan, but is pretty much a no-go for OpenGL.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> See what the eviction lock in amdgpu is doing for example.
>>>>>>>>>
>>>>>>>>> The eviction_lock seems to protect a VM state "evicting" of
>>>>>>>>> whether any BO that
>>>>>>>>> is associated with the VM is currently evicting. At the same
>>>>>>>>> time amdgpu protects
>>>>>>>>> the eviceted list of the VM with a different lock. So this
>>>>>>>>> seems to be entirely
>>>>>>>>> unrelated. Tracking a "currently evicting" state is not part
>>>>>>>>> of the GPUVM
>>>>>>>>> implementation currently and hence nothing would change for
>>>>>>>>> amdgpu there.
>>>>>>>>
>>>>>>>> Sorry for the confusion we use different terminology in amdgpu.
>>>>>>>>
>>>>>>>> The eviction lock and evicted state is for the VM page tables,
>>>>>>>> e.g. if the whole VM is currently not used and swapped out or
>>>>>>>> even de-allocated.
>>>>>>>>
>>>>>>>> This is necessary because we have cases where we need to access
>>>>>>>> the VM data without holding the dma-resv lock of this VM.
>>>>>>>> Especially figuring out which parts of an address space contain
>>>>>>>> mappings and which doesn't.
>>>>>>>
>>>>>>> I think this is fine, this has nothing to do with lists of
>>>>>>> evicted GEM objects or external GEM
>>>>>>> objects, right? Marking mappings (drm_gpuva) as invalidated
>>>>>>> (DRM_GPUVA_INVALIDATED) or accessing
>>>>>>> the VA space does not require any dma-resv locks.
>>>>>>
>>>>>> I hope so, but I'm not 100% sure.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> This is a requirement which comes with HMM handling, you won't
>>>>>>>> see this with Vulkan (or OpenGL, VAAPI etc..).
>>>>>>>>
>>>>>>>>
>>>>>>>> The invalidation lock on the other hand is what in this
>>>>>>>> discussion is called eviction lock. This one is needed because
>>>>>>>> what I wrote above, during the move callback only the dma-resv
>>>>>>>> of the BO which is moved is locked, but not necessarily the
>>>>>>>> dma-resv of the VM.
>>>>>>>
>>>>>>> That's yet another thing, right? This is used to track whether
>>>>>>> *any* BO that belongs to the VM is
>>>>>>> currently being evicted, correct? As mentioned, as by now this
>>>>>>> is not supported in GPUVM and hence
>>>>>>> would be the same driver specific code with the same driver
>>>>>>> specifc lock.
>>>>>>
>>>>>> That is most likely a show stopper using this for OpenGL based
>>>>>> workloads as far as I can see. For those you need to able to
>>>>>> figure out which non-VM BOs have been evicted and which parts of
>>>>>> the VM needs updates.
>>>>>
>>>>> We identify those with a bool in the gpuvm_bo, and that bool is
>>>>> protected by the bo_resv. In essence, the "evicted" list must be
>>>>> made up-to-date with all relevant locks held before traversing in
>>>>> the next exec.
>>>>
>>>> What I still miss with this idea is how do we find all the
>>>> drm_gpuvm_bo structures with the evicted bool set to true? When
>>>> doing the drm_exec dance we come across all external ones and can
>>>> add them to the list if needed, but what about the BOs having the
>>>> VM's dma-resv?
>>>
>>> Oh, they can be added to the evict list directly (no bool needed) in
>>> the eviction code, like in v3. Since for those we indeed hold the
>>> VM's dma_resv since it's aliased with the object's dma-resv.
>>
>> Yeah, I wanted to note what Danilo seems to think about as well. How
>> do we figure out the non-VM BOs evicted?
>>
>> We can't walk over the list of all non-VM BOs on every submission,
>> that's to much overhead for cases with lots of non-VM BOs.
>>
>> And we can't rely on userspace sending all non-VM BOs as used list
>> down to the kernel with each submission.
>>
>> Regards,
>> Christian.
>
> No, that's not needed: Mechanism below.
>
> 1) We maintain an evicted list. Typically protected by the vm resv.
> 2) Each gpuvm_bo has a bool "evicted". Protected by the bo resv.
>
> a) Evicting a vm bo: The vm resv is held by the eviction code. Just
> put it on the evicted list.
> b) Evicting a shared/external bo: The bo resv is held by the eviction
> code. Set the "evicted" bool
> c) Validating the evicted list on exec: Loop through all
> *external/shared* bos. Lock them. After locking, check the "evicted"
> bool, if it's true. put the bo on the evicted list (we hold the VM
> resv at this point) and clear the "evicted" bool. Note that other vms
> will have their own gpuvm_bo which is marked evicted.
>
> I have this coded up in a patch for Xe and it seems to be working
> properly.
>
> /Thomas
>
Something along the lines of the attach patch.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-drm-gpuvm-Adjustment-for-extobj-eviction.patch
Type: text/x-patch
Size: 3065 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/nouveau/attachments/20230920/4d018ac1/attachment.bin>
More information about the Nouveau
mailing list