[Nouveau] [PATCH drm-misc-next 2/3] drm/gpuva_mgr: generalize dma_resv/extobj handling and GEM validation

Thu Aug 31 17:23:35 UTC 2023

On 8/31/23 18:53, Thomas Hellström (Intel) wrote:
> Hi,
>
> On 8/31/23 13:18, Danilo Krummrich wrote:
>> On Thu, Aug 31, 2023 at 11:04:06AM +0200, Thomas Hellström (Intel) 
>> wrote:
>>> Hi!
>>>
>>> On 8/30/23 17:00, Danilo Krummrich wrote:
>>>> On Wed, Aug 30, 2023 at 03:42:08PM +0200, Thomas Hellström (Intel) 
>>>> wrote:
>>>>> On 8/30/23 14:49, Danilo Krummrich wrote:
>>>>>> Hi Thomas,
>>>>>>
>>>>>> thanks for having a look!
>>>>>>
>>>>>> On Wed, Aug 30, 2023 at 09:27:45AM +0200, Thomas Hellström 
>>>>>> (Intel) wrote:
>>>>>>> Hi, Danilo.
>>>>>>>
>>>>>>> Some quick comments since I'm doing some Xe work in this area. 
>>>>>>> Will probably
>>>>>>> get back with more.
>>>>>>>
>>>>>>> On 8/20/23 23:53, Danilo Krummrich wrote:
>> <snip>
>>
>>>>>>>> diff --git a/include/drm/drm_gpuva_mgr.h 
>>>>>>>> b/include/drm/drm_gpuva_mgr.h
>>>>>>>> index ed8d50200cc3..693e2da3f425 100644
>>>>>>>> --- a/include/drm/drm_gpuva_mgr.h
>>>>>>>> +++ b/include/drm/drm_gpuva_mgr.h
>>>>>>>> @@ -26,12 +26,16 @@
>>>>>>>>       */
>>>>>>>>      #include <linux/list.h>
>>>>>>>> +#include <linux/dma-resv.h>
>>>>>>>> +#include <linux/maple_tree.h>
>>>>>>>>      #include <linux/rbtree.h>
>>>>>>>>      #include <linux/types.h>
>>>>>>>>      #include <drm/drm_gem.h>
>>>>>>>> +#include <drm/drm_exec.h>
>>>>>>>>      struct drm_gpuva_manager;
>>>>>>>> +struct drm_gpuva_gem;
>>>>>>>>      struct drm_gpuva_fn_ops;
>>>>>>>>      /**
>>>>>>>> @@ -140,7 +144,7 @@ struct drm_gpuva {
>>>>>>>>      int drm_gpuva_insert(struct drm_gpuva_manager *mgr, struct 
>>>>>>>> drm_gpuva *va);
>>>>>>>>      void drm_gpuva_remove(struct drm_gpuva *va);
>>>>>>>> -void drm_gpuva_link(struct drm_gpuva *va);
>>>>>>>> +void drm_gpuva_link(struct drm_gpuva *va, struct drm_gpuva_gem 
>>>>>>>> *vm_bo);
>>>>>>>>      void drm_gpuva_unlink(struct drm_gpuva *va);
>>>>>>>>      struct drm_gpuva *drm_gpuva_find(struct drm_gpuva_manager 
>>>>>>>> *mgr,
>>>>>>>> @@ -240,15 +244,137 @@ struct drm_gpuva_manager {
>>>>>>>>           * @ops: &drm_gpuva_fn_ops providing the split/merge 
>>>>>>>> steps to drivers
>>>>>>>>           */
>>>>>>>>          const struct drm_gpuva_fn_ops *ops;
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @d_obj: Dummy GEM object; used internally to pass the 
>>>>>>>> GPU VMs
>>>>>>>> +     * dma-resv to &drm_exec.
>>>>>>>> +     */
>>>>>>>> +    struct drm_gem_object d_obj;
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @resv: the &dma_resv for &drm_gem_objects mapped in 
>>>>>>>> this GPU VA
>>>>>>>> +     * space
>>>>>>>> +     */
>>>>>>>> +    struct dma_resv *resv;
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @exec: the &drm_exec helper to lock external 
>>>>>>>> &drm_gem_objects
>>>>>>>> +     */
>>>>>>>> +    struct drm_exec exec;
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @mt_ext: &maple_tree storing external &drm_gem_objects
>>>>>>>> +     */
>>>>>>>> +    struct maple_tree mt_ext;
>>>>>>> Why are you using a maple tree here? Insertion and removal is 
>>>>>>> O(log(n))
>>>>>>> instead of O(1) for a list?
>>>>>>>
>>>>>> Having a list of drm_gem_objects directly wouldn't work, as 
>>>>>> multiple GPU-VMs
>>>>>> could have mappings of the same extobj.
>>>>>>
>>>>>> I considered using the VM_BO abstraction (struct drm_gpuva_gem) 
>>>>>> as list entry
>>>>>> instead, which also seems to be the obvious choice. However, 
>>>>>> there is a locking
>>>>>> conflict.
>>>>>>
>>>>>> A drm_gem_object keeps a list of drm_gpuva_gems, while each 
>>>>>> drm_gpuva_gem keeps
>>>>>> a list of drm_gpuvas. Both lists are either protected with the 
>>>>>> dma-resv lock of
>>>>>> the corresponding drm_gem_object, or with an external lock 
>>>>>> provided by the
>>>>>> driver (see drm_gem_gpuva_set_lock()). The latter is used by 
>>>>>> drivers performing
>>>>>> changes on the GPUVA space directly from the fence signalling path.
>>>>>>
>>>>>> Now, similar to what drm_gpuva_link() and drm_gpuva_unlink() are 
>>>>>> doing already,
>>>>>> we'd want to add a drm_gpuva_gem to the extobj list for the first 
>>>>>> mapping being
>>>>>> linked and we'd want to remove it for the last one being unlinked.
>>>>>>
>>>>>> (Actually we'd want to add the drm_gpuva_gem object to the extobj 
>>>>>> list even
>>>>>> before, because otherwise we'd not acquire it's dma-resv lock of 
>>>>>> this GEM object
>>>>>> through drm_gpuva_manager_lock(). But that's trival, we could do 
>>>>>> that when we
>>>>>> create the drm_gpuva_gem, which we need to do anyways.)
>>>>>>
>>>>>> Anyway, we'd probably want to keep removing the drm_gpuva_gem 
>>>>>> from the extobj
>>>>>> list from drm_gpuva_unlink() when the last mapping of this BO is 
>>>>>> unlinked. In
>>>>>> order to do so, we'd (as discussed above) either need to hold the 
>>>>>> outer GPU-VM
>>>>>> lock or the GPU-VMs dma-resv lock. Both would be illegal in the case
>>>>>> drm_gpuva_unlink() is called from within the fence signalling 
>>>>>> path. For drivers
>>>>>> like XE or Nouveau, we'd at least need to make sure to not mess 
>>>>>> up the locking
>>>>>> hierarchy of GPU-VM lock and dma-resv lock of the corresponding BO.
>>>>>>
>>>>>> Considering all that, I thought it's probably better to track 
>>>>>> extobjs separate
>>>>>> from the drm_gpuva_gem, hence the maple tree choice.
>>>>> Hm. OK, in Xe we're having a list of the xe_vmas (drm_gpuvas) that 
>>>>> point to
>>>>> external objects, or in the case of multiple mappings to the same gem
>>>>> object, only one of the drm_gpuvas is in the list. These are 
>>>>> protected by
>>>>> the GPU-VM lock. I don't see a problem with removing those from 
>>>>> the fence
>>>>> signalling path, though?
>>>> I intentionally tried to avoid keeping a list of drm_gpuvas to 
>>>> track extobjs,
>>>> since this is generic code I don't know how much mappings of an 
>>>> external object
>>>> the corresponding driver potentially creates. This could become a 
>>>> pretty large
>>>> list to iterate. Another reason was, that I want to keep the 
>>>> drm_gpuva structure
>>>> as small as possible, hence avoiding another list_head.
>>> Yes, the list might be pretty large, but OTOH you never iterate to 
>>> access a
>>> single list element. When you need to iterate the whole list you 
>>> need to do
>>> that regardless of the data structure used. As for the list head, it 
>>> might
>>> perhaps be aliased (union) with an upcoming userptr list head?
>>>
>> Oh, I did not mean that I'm concerned about the size of a list of 
>> extobjs in
>> general, that would indeed be the same for every data structure 
>> chosen. But I
>> would be concerned about keeping a list of *all* mappings being 
>> backed by an
>> extobj.
>>
>>>> Now, it sounds like in XE you're doing some kind of optimization 
>>>> just keeping a
>>>> single mapping of an extobj in the list? How do you know when to 
>>>> remove it? What
>>>> if the mapping from the extobj list gets unmapped, but there is 
>>>> still another
>>>> one left in the GPU-VM being backed by the same BO?
>>> When removing from the lists, we iterate through the object's list 
>>> of vmas,
>>> and if there is one matching the same vm, we replace the old one 
>>> with the
>>> new one. A similar iteration is done when adding to avoid adding one 
>>> that is
>>> already on the list.
>> I see, but wouldn't this be O(n) on insertion and O(m) on removal of 
>> an extobj,
>> while using the maple tree is O(log(n))?
>
> No, insertion and removal is O(m) where m is the number of vms the 
> object is currently bound to. Typically a very small number.
>
>>
>>>> Although assuming that's a no-go for GPUVA wouldn't an XArray be a 
>>>> better
>>>> choice, keeping O(1)?
>>>> When tracking extobjs, the address of the drm_gem_object is the key 
>>>> while the
>>>> reference count is the value. I was thinking of an XArray as well, 
>>>> but I was
>>>> worried that the corresponding indices could be too much 
>>>> distributed for an
>>>> XArray to still be efficient. Now that I think about it, it's 
>>>> probably not that
>>>> bad.
>>>>
>>>> Btw., while I agree trying to make things as efficient as possible, 
>>>> what is the
>>>> magnitue for extobjs to be tracked, do we need to worry about the 
>>>> O(log(n))?
>>> Not sure yet, TBH, but I think one of our UMDs can only use external 
>>> object,
>>> because they don't know at creation time which ones need exporting. 
>>> However
>>> if this turns out to be too bad, there are various flavours of 
>>> "clever but
>>> complicated" optimizations that we could think of to reduce the list 
>>> size.
>>> Still in our case, we opted for the vma list head for now.
>> Considering the above, I would guess that if your current approach is 
>> good
>> enough, a maple tree will work as well.
>
> Hmm, Yeah it's probably a bikeshed since each drm_exec builds a 
> realloced array of all external objects on each exec.
>
>>
>> Otherwise, if you want, I could do some experiments with Xarray and 
>> see how
>> that works out compared to using a maple tree.
>>
>> Btw. another nice thing about using Xarray or maple tree for that is 
>> that
>> drivers updating the VA space from the fence signalling path don't 
>> need to
>> hold a GPU-VM lock to update the extobj list. Actually, they might 
>> not need
>> a GPU-VM lock at all.
>
> I still don't follow why drivers would want to do that. Isn't the VA 
> space / fence object list always updated sync from the IOCTL?

meaning external object list ofc. :)

/Thomas

>
> /Thomas
>
>
>>
>>> /Thomas
>>>
>>>
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @evict: structure holding the evict list and evict list 
>>>>>>>> lock
>>>>>>>> +     */
>>>>>>>> +    struct {
>>>>>>>> +        /**
>>>>>>>> +         * @list: &list_head storing &drm_gem_objects 
>>>>>>>> currently being
>>>>>>>> +         * evicted
>>>>>>>> +         */
>>>>>>>> +        struct list_head list;
>>>>>>>> +
>>>>>>>> +        /**
>>>>>>>> +         * @lock: spinlock to protect the evict list against 
>>>>>>>> concurrent
>>>>>>>> +         * insertion / removal of different &drm_gpuva_gems
>>>>>>>> +         */
>>>>>>>> +        spinlock_t lock;
>>>>>>>> +    } evict;
>>>>>>>>      };
>>>>>>>>      void drm_gpuva_manager_init(struct drm_gpuva_manager *mgr,
>>>>>>>> +                struct drm_device *drm,
>>>>>>>>                      const char *name,
>>>>>>>>                      u64 start_offset, u64 range,
>>>>>>>>                      u64 reserve_offset, u64 reserve_range,
>>>>>>>>                      const struct drm_gpuva_fn_ops *ops);
>>>>>>>>      void drm_gpuva_manager_destroy(struct drm_gpuva_manager 
>>>>>>>> *mgr);
>>>>>>>> +/**
>>>>>>>> + * DRM_GPUVA_EXEC - returns the &drm_gpuva_managers &drm_exec 
>>>>>>>> instance
>>>>>>>> + * @mgr: the &drm_gpuva_managers to return the &drm_exec 
>>>>>>>> instance for
>>>>>>>> + */
>>>>>>>> +#define DRM_GPUVA_EXEC(mgr)    &(mgr)->exec
>>>>>>> A struct ww_acquire_ctx and thus a drm_exec is fundamentally per 
>>>>>>> task and
>>>>>>> should typically be allocated on the stack. Otherwise you'd need 
>>>>>>> to protect
>>>>>>> the mgr->exec member with an exclusive lock throughout the 
>>>>>>> locking process,
>>>>>>> and that's not what we want.
>>>>>> Oh, good point. I think it works in Nouveau, because there it's 
>>>>>> implicitly
>>>>>> protected with the job submission lock.
>>>>>>
>>>>>>> Did you consider subclassing a drm_exec for drm_gpuva purposes 
>>>>>>> and add
>>>>>>> needed ops to it: Like so:
>>>>>> That's a good idea, will take this into V2.
>>>>> Actually, I'm not fully sure that was a good idea: I've now have a 
>>>>> working
>>>>> version of Xe ported over to drm_exec, having these helpers in 
>>>>> mind and with
>>>>> the intention to start using them as they mature. What I found, 
>>>>> though is
>>>>> that open-coding the drm_exec loop is not all that bad, but that 
>>>>> building
>>>>> blocks that can be called from within the loop are useful:
>>>>>
>>>>> Like the drm_gpuva_prepare_objects() and an imaginary
>>>>> drm_gpuva_prepare_gpuva() that locks the vm resv and the resv of 
>>>>> the object
>>>>> (if different and the gpuva points to the object. And
>>>>> drm_gpuva_prepare_array() although we don't use it within Xe. That 
>>>>> means you
>>>>> can use these building blocks like helpers and avoid the fn() 
>>>>> callback by
>>>>> instead open-coding.
>>>>>
>>>>> But I guess YMMV.
>>>> That's exactly why those building blocks are exported, I already 
>>>> had in mind
>>>> that there might be drivers which still want to open-code the 
>>>> drm_exec loop,
>>>> while others might just want a simple interface to lock everything.
>>>>
>>>> I still think it is a good idea, but I'd keep that as simple as 
>>>> possible. And
>>>> for everything else just let the driver open-code it and use the 
>>>> "building
>>>> blocks" - will also expand the bulding blocks to what you mentioned 
>>>> above.
>>>>
>>>>>>> struct drm_gpuva_exec_ops {
>>>>>>>        int (*fn) (struct drm_gpuva_exec *exec, int num_fences);
>>>>>> Is this the fn argument from drm_gpuva_manager_lock_extra()?
>>>>>>
>>>>>>>        int (*bo_validate) (struct drm_gpuva_exec *exec, struct 
>>>>>>> drm_gem_object
>>>>>>> *obj);
>>>>>> I guess we could also keep that within the drm_gpuva_fn_ops? This 
>>>>>> should always
>>>>>> be the same callback, right?
>>>>>>
>>>>>>> };
>>>>>>>
>>>>>>> struct drm_gpuva_exec {
>>>>>>>        const struct drm_gpuva_exec_ops *ops;
>>>>>>>        struct drm_exec exec;
>>>>>>>        struct drm_gpuva_manager *mgr;
>>>>>>> };
>>>>>>>
>>>>>>> Although I'd actually expect bo_validate to be part of fn in the 
>>>>>>> typical
>>>>>>> case. The drm_gpuva_exec would then be allocated by the caller 
>>>>>>> on the stack.
>>>>>> This doesn't sound like my assumption about fn() above is correct.
>>>>> Well one important thing in our conversion is that ttm_bo_validate 
>>>>> () needs
>>>>> to be in the until_all_locked() loop. We want to be able soon to use
>>>>> sleeping locks for eviction, so a xe_bo_validate() would, at least
>>>>> temporarily, add locked objects to the drm_exec list of locked 
>>>>> objects. That
>>>>> means everything that may end up calling validate deep within the 
>>>>> call chain
>>>>> needs to be part of the until_all_locked() loop, so our
>>>>> drm_gpuva_manager_lock_extra() fn callback would include those 
>>>>> validates and
>>>>> look different all the time. Hence that's why open-coding isn't 
>>>>> all that
>>>>> bad...
>>>> Oh, I see. You indeed want to call validate() from within 
>>>> until_all_locked().
>>>>
>>>>> /Thomas
>>>>>
>>>>>
>> <snip>