[Intel-gfx] [PATCH 3/3] drm/doc/rfc: VM_BIND uapi definition

Tue Jun 14 16:07:37 UTC 2022

On 14/06/2022 17:02, Tvrtko Ursulin wrote:
> 
> On 14/06/2022 16:43, Niranjana Vishwanathapura wrote:
>> On Tue, Jun 14, 2022 at 08:16:41AM +0100, Tvrtko Ursulin wrote:
>>>
>>> On 14/06/2022 00:39, Matthew Brost wrote:
>>>> On Mon, Jun 13, 2022 at 07:09:06PM +0100, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 13/06/2022 18:49, Niranjana Vishwanathapura wrote:
>>>>>> On Mon, Jun 13, 2022 at 05:22:02PM +0100, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> On 13/06/2022 16:05, Niranjana Vishwanathapura wrote:
>>>>>>>> On Mon, Jun 13, 2022 at 09:24:18AM +0100, Tvrtko Ursulin wrote:
>>>>>>>>>
>>>>>>>>> On 10/06/2022 17:14, Niranjana Vishwanathapura wrote:
>>>>>>>>>> On Fri, Jun 10, 2022 at 05:48:39PM +0300, Lionel Landwerlin 
>>>>>>>>>> wrote:
>>>>>>>>>>> On 10/06/2022 13:37, Tvrtko Ursulin wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/06/2022 08:07, Niranjana Vishwanathapura wrote:
>>>>>>>>>>>>> VM_BIND and related uapi definitions
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Niranjana Vishwanathapura
>>>>>>>>>>>>> <niranjana.vishwanathapura at intel.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>   Documentation/gpu/rfc/i915_vm_bind.h | 490
>>>>>>>>>>>>> +++++++++++++++++++++++++++
>>>>>>>>>>>>>   1 file changed, 490 insertions(+)
>>>>>>>>>>>>>   create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>> a/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>>>> b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>>>> new file mode 100644
>>>>>>>>>>>>> index 000000000000..9fc854969cfb
>>>>>>>>>>>>> --- /dev/null
>>>>>>>>>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h
>>>>>>>>>>>>> @@ -0,0 +1,490 @@
>>>>>>>>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>>>>>>>>> +/*
>>>>>>>>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +/**
>>>>>>>>>>>>> + * DOC: I915_PARAM_HAS_VM_BIND
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * VM_BIND feature availability.
>>>>>>>>>>>>> + * See typedef drm_i915_getparam_t param.
>>>>>>>>>>>>> + * bit[0]: If set, VM_BIND is supported, otherwise not.
>>>>>>>>>>>>> + * bits[8-15]: VM_BIND implementation version.
>>>>>>>>>>>>> + * version 0 will not have VM_BIND/UNBIND
>>>>>>>>>>>>> timeline fence array support.
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> +#define I915_PARAM_HAS_VM_BIND        57
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +/**
>>>>>>>>>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * Flag to opt-in for VM_BIND mode of binding during VM 
>>>>>>>>>>>>> creation.
>>>>>>>>>>>>> + * See struct drm_i915_gem_vm_control flags.
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * The older execbuf2 ioctl will not
>>>>>>>>>>>>> support VM_BIND mode of operation.
>>>>>>>>>>>>> + * For VM_BIND mode, we have new execbuf3
>>>>>>>>>>>>> ioctl which will not accept any
>>>>>>>>>>>>> + * execlist (See struct
>>>>>>>>>>>>> drm_i915_gem_execbuffer3 for more details).
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +/**
>>>>>>>>>>>>> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * Flag to declare context as long running.
>>>>>>>>>>>>> + * See struct drm_i915_gem_context_create_ext flags.
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * Usage of dma-fence expects that they
>>>>>>>>>>>>> complete in reasonable amount of time.
>>>>>>>>>>>>> + * Compute on the other hand can be long
>>>>>>>>>>>>> running. Hence it is not appropriate
>>>>>>>>>>>>> + * for compute contexts to export request
>>>>>>>>>>>>> completion dma-fence to user.
>>>>>>>>>>>>> + * The dma-fence usage will be limited to
>>>>>>>>>>>>> in-kernel consumption only.
>>>>>>>>>>>>> + * Compute contexts need to use user/memory fence.
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * So, long running contexts do not support output fences. 
>>>>>>>>>>>>> Hence,
>>>>>>>>>>>>> + * I915_EXEC_FENCE_SIGNAL (See
>>>>>>>>>>>>> &drm_i915_gem_exec_fence.flags) is expected
>>>>>>>>>>>>> + * to be not used. DRM_I915_GEM_WAIT ioctl
>>>>>>>>>>>>> call is also not supported for
>>>>>>>>>>>>> + * objects mapped to long running contexts.
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +/* VM_BIND related ioctls */
>>>>>>>>>>>>> +#define DRM_I915_GEM_VM_BIND        0x3d
>>>>>>>>>>>>> +#define DRM_I915_GEM_VM_UNBIND        0x3e
>>>>>>>>>>>>> +#define DRM_I915_GEM_EXECBUFFER3    0x3f
>>>>>>>>>>>>> +#define DRM_I915_GEM_WAIT_USER_FENCE    0x40
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +#define DRM_IOCTL_I915_GEM_VM_BIND
>>>>>>>>>>>>> DRM_IOWR(DRM_COMMAND_BASE +
>>>>>>>>>>>>> DRM_I915_GEM_VM_BIND, struct
>>>>>>>>>>>>> drm_i915_gem_vm_bind)
>>>>>>>>>>>>> +#define DRM_IOCTL_I915_GEM_VM_UNBIND
>>>>>>>>>>>>> DRM_IOWR(DRM_COMMAND_BASE +
>>>>>>>>>>>>> DRM_I915_GEM_VM_UNBIND, struct
>>>>>>>>>>>>> drm_i915_gem_vm_bind)
>>>>>>>>>>>>> +#define DRM_IOCTL_I915_GEM_EXECBUFFER3
>>>>>>>>>>>>> DRM_IOWR(DRM_COMMAND_BASE +
>>>>>>>>>>>>> DRM_I915_GEM_EXECBUFFER3, struct
>>>>>>>>>>>>> drm_i915_gem_execbuffer3)
>>>>>>>>>>>>> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE
>>>>>>>>>>>>> DRM_IOWR(DRM_COMMAND_BASE +
>>>>>>>>>>>>> DRM_I915_GEM_WAIT_USER_FENCE, struct
>>>>>>>>>>>>> drm_i915_gem_wait_user_fence)
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +/**
>>>>>>>>>>>>> + * struct drm_i915_gem_vm_bind - VA to object mapping to 
>>>>>>>>>>>>> bind.
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * This structure is passed to VM_BIND
>>>>>>>>>>>>> ioctl and specifies the mapping of GPU
>>>>>>>>>>>>> + * virtual address (VA) range to the
>>>>>>>>>>>>> section of an object that should be bound
>>>>>>>>>>>>> + * in the device page table of the specified address space 
>>>>>>>>>>>>> (VM).
>>>>>>>>>>>>> + * The VA range specified must be unique
>>>>>>>>>>>>> (ie., not currently bound) and can
>>>>>>>>>>>>> + * be mapped to whole object or a section
>>>>>>>>>>>>> of the object (partial binding).
>>>>>>>>>>>>> + * Multiple VA mappings can be created to
>>>>>>>>>>>>> the same section of the object
>>>>>>>>>>>>> + * (aliasing).
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * The @queue_idx specifies the queue to
>>>>>>>>>>>>> use for binding. Same queue can be
>>>>>>>>>>>>> + * used for both VM_BIND and VM_UNBIND
>>>>>>>>>>>>> calls. All submitted bind and unbind
>>>>>>>>>>>>> + * operations in a queue are performed in the order of 
>>>>>>>>>>>>> submission.
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * The @start, @offset and @length should
>>>>>>>>>>>>> be 4K page aligned. However the DG2
>>>>>>>>>>>>> + * and XEHPSDV has 64K page size for device
>>>>>>>>>>>>> local-memory and has compact page
>>>>>>>>>>>>> + * table. On those platforms, for binding
>>>>>>>>>>>>> device local-memory objects, the
>>>>>>>>>>>>> + * @start should be 2M aligned, @offset and
>>>>>>>>>>>>> @length should be 64K aligned.
>>>>>>>>>>>>> + * Also, on those platforms, it is not
>>>>>>>>>>>>> allowed to bind an device local-memory
>>>>>>>>>>>>> + * object and a system memory object in a
>>>>>>>>>>>>> single 2M section of VA range.
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> +struct drm_i915_gem_vm_bind {
>>>>>>>>>>>>> +    /** @vm_id: VM (address space) id to bind */
>>>>>>>>>>>>> +    __u32 vm_id;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +    /** @queue_idx: Index of queue for binding */
>>>>>>>>>>>>> +    __u32 queue_idx;
>>>>>>>>>>>>
>>>>>>>>>>>> I have a question here to which I did not find
>>>>>>>>>>>> an answer by browsing the old threads.
>>>>>>>>>>>>
>>>>>>>>>>>> Queue index appears to be an implicit
>>>>>>>>>>>> synchronisation mechanism, right? Operations on
>>>>>>>>>>>> the same index are executed/complete in order of
>>>>>>>>>>>> ioctl submission?
>>>>>>>>>>>>
>>>>>>>>>>>> Do we _have_ to implement this on the kernel
>>>>>>>>>>>> side and could just allow in/out fence and let
>>>>>>>>>>>> userspace deal with it?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It orders operations like in a queue. Which is kind
>>>>>>>>>>> of what happens with existing queues/engines.
>>>>>>>>>>>
>>>>>>>>>>> If I understood correctly, it's going to be a
>>>>>>>>>>> kthread + a linked list right?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, that is correct.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Lionel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Arbitrary/on-demand number of queues will add
>>>>>>>>>>>> the complexity on the kernel side which should
>>>>>>>>>>>> be avoided if possible.
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It was discussed in the other thread. Jason prefers this over 
>>>>>>>>>> putting
>>>>>>>>>> an artificial limit on number of queues (as user can
>>>>>>>>>> anyway can exhaust
>>>>>>>>>> the memory). I think complexity in the driver is manageable.
>>>>>>>>>
>>>>>>>>> You'll need to create tracking structures on demand, with
>>>>>>>>> atomic replace of last fence, ref counting and locking of
>>>>>>>>> some sort, more or less?
>>>>>>>>>
>>>>>>>>
>>>>>>>> We will have a workqueue, an work item and a linked list per queue.
>>>>>>>> VM_BIND/UNBIND call will add the mapping request to the
>>>>>>>> specified queue's
>>>>>>>> linked list and schedule the work item on the workqueue of that 
>>>>>>>> queue.
>>>>>>>> I am not sure what you mean by last fence and replacing it.
>>>>>>>>
>>>>>>>>>> The other option being discussed in to have the user create those
>>>>>>>>>> queues (like creating engine map) before hand and use that in 
>>>>>>>>>> vm_bind
>>>>>>>>>> and vm_unbind ioctls. This puts a limit on the number of queues.
>>>>>>>>>> But it is not clean either and not sure it is worth
>>>>>>>>>> making the interface
>>>>>>>>>> more complex.
>>>>>>>>>> https://www.spinics.net/lists/dri-devel/msg350448.html
>>>>>>>>>
>>>>>>>>> What about the third option of a flag to return a fence (of
>>>>>>>>> some sort) and pass in a fence? That way userspace can
>>>>>>>>> imagine zero or N queues with very little effort on the
>>>>>>>>> kernel side. Was this considered?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I am not clear what fence you are talking about here and how 
>>>>>>>> does that
>>>>>>>> help with the number of vm_bind queues. Can you eloborate?
>>>>>>>
>>>>>>> It is actually already documented that bind/unbind will support
>>>>>>> input and output fences - so what are these queues on top of what
>>>>>>> userspace can already achieve by using them? Purely a convenience or
>>>>>>> there is more to it?
>>>>>>>
>>>>>>
>>>>>> Oh, the vm_bind queues are discussed in this thread.
>>>>>> https://lists.freedesktop.org/archives/intel-gfx/2022-June/299217.html 
>>>>>>
>>>>>>
>>>>>> Apparently Vulkan has requirement for multiple queues, each queue
>>>>>> processing vm_bind/unbind calls in the order of submission.
>>>>>
>>>>> I don't see how that answers my question so I will take the freedom to
>>>>> repeat it. What are these queues on top of what userspace can already
>>>>> achieve by using in-out fences? Purely a convenience or there is 
>>>>> more to it?
>>>>>
>>>>> Queue1:
>>>>>
>>>>> out_fence_A = vm_bind A
>>>>> out_fence_B = vm_bind B, in_fence=out_fence_A
>>>>> execbuf(in_fence = out_fence_B)
>>>>>
>>>>> Queue2:
>>>>>
>>>>> out_fence_C = vm_bind C
>>>>> out_fence_D = vm_bind D, in_fence=out_fence_C
>>>>> execbuf(in_fence = out_fence_D)
>>>>>
>>>>> Parallel bind:
>>>>> out_fence_E = vm_bind E
>>>>> out_fence_F = vm_bind F
>>>>> merged_fence = fence_merge(out_fence_E, out_fence_F)
>>>>> execbuf(in_fence = merged_fence)
>>>>>
>>>>
>>>> Let's say you do this and only 1 queue:
>>>>
>>>> VM_BIND_A (in_fence=fence_A)
>>>> VM_BIND_B (in_fence=NULL)
>>>>
>>>> With 1 queue VM_BIND_B in blocked on fence_A, hence the need for than 1
>>>> queue.
>>>
>>> I don't follow - there isn't a concept of a queue exposed in uapi in 
>>> what I have described so the above two run in parallel there, if we 
>>> ignore fence_A in your example doesn't even exist before you pass it 
>>> to bind A so something is not right.
>>>
>>>> e.g.
>>>> VM_BIND_A (queue_id=0, in_fence=fence_A)
>>>> VM_BIND_B (queue_id=1, in_fence=NULL)
>>>>
>>>> Now VM_BIND_B can immediately be executed regardless of fence_A status.
>>>
>>> In my examples userspace can serialise or not as it sees fit using 
>>> fences. The "parallel bind" examples two binds run in parallel. 
>>> Userspace can create multiple such parallel "queues" if it wanted.
>>>
>>> Parallel bind 1 and 2 interleaved:
>>> out_fence_A = vm_bind A
>>> out_fence_B = vm_bind B
>>> out_fence_C = vm_bind C
>>> out_fence_D = vm_bind D
>>> // all binds can run in parallel
>>> merged_fence_1 = fence_merge(out_fence_A, out_fence_B)
>>> merged_fence_2 = fence_merge(out_fence_C, out_fence_D)
>>> execbuf(in_fence = merged_fence_1) // after A&B to finish
>>> execbuf(in_fence = merged_fence_2) // after C&D finish
>>>
>>> There is a huge disconnect somewhere but I don't know where.
>>>
>>
>> Note that Vulkan has requirement that VM_BIND and VM_UNBIND
>> operations will also have 'in' fences associated with them
>> and not just the 'out' fences (which your example above shows).
> 
> I gave more examples earlier:
> 
> """
> Queue1:
> 
> out_fence_A = vm_bind A
> out_fence_B = vm_bind B, in_fence=out_fence_A
> execbuf(in_fence = out_fence_B)
> """
> 
> Clearly I showed both in and out fence.
> 
>> Yes, one of the solution discussed was not to have any queue_idx
>> at all (assume single queue) and let the vm_bind/unbind operations
>> submitted run and complete out of submission order. That way
>> a vm_bind/unbind sumitted later will not be blocked by a vm_bind/unbind
>> submitted earlier.
>> But removing the ordering here comes at a cost. Having the operations
>> run in submission order has some benefits. These are discussed in the
>> other thread.
>> https://lists.freedesktop.org/archives/intel-gfx/2022-June/299217.html
> 
> That is some messed up deep quoting in that link. Could you please 
> summarize the cost which queues in the uapi intended to avoid?
> 
> In any case it is not just for me. A significant addition is proposed 
> for the driver so there should be a clear summary of cost vs benefit 
> rather than a messy thread.
> 
>> Hence having multiple queues gives the benefit of both worlds and gives
>> user more options.
> 
> Maybe, but lets be specific.

Also, it can't really give the benefit of both worlds. You can't go 
fully async with the queue_idx scheme without i915 having to create N 
internal queues (where N = number of bind operations submitted).

Therefore I suspect it's a trade-off between cost and convenience but 
I'd like things clearly summarized so decision can be made.

Regards,

Tvrtko