[Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
Niranjana Vishwanathapura
niranjana.vishwanathapura at intel.com
Tue Jun 7 21:32:10 UTC 2022
On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>> <niranjana.vishwanathapura at intel.com> wrote:
>>
>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>> > On 02/06/2022 23:35, Jason Ekstrand wrote:
>> >
>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>> > <niranjana.vishwanathapura at intel.com> wrote:
>> >
>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>> wrote:
>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>> > >> > +VM_BIND/UNBIND ioctl will immediately start
>> binding/unbinding
>> > the mapping in an
>> > >> > +async worker. The binding and unbinding will work like a
>> special
>> > GPU engine.
>> > >> > +The binding and unbinding operations are serialized and
>> will
>> > wait on specified
>> > >> > +input fences before the operation and will signal the
>> output
>> > fences upon the
>> > >> > +completion of the operation. Due to serialization,
>> completion of
>> > an operation
>> > >> > +will also indicate that all previous operations are also
>> > complete.
>> > >>
>> > >> I guess we should avoid saying "will immediately start
>> > binding/unbinding" if
>> > >> there are fences involved.
>> > >>
>> > >> And the fact that it's happening in an async worker seem to
>> imply
>> > it's not
>> > >> immediate.
>> > >>
>> >
>> > Ok, will fix.
>> > This was added because in earlier design binding was deferred
>> until
>> > next execbuff.
>> > But now it is non-deferred (immediate in that sense). But yah,
>> this is
>> > confusing
>> > and will fix it.
>> >
>> > >>
>> > >> I have a question on the behavior of the bind operation when
>> no
>> > input fence
>> > >> is provided. Let say I do :
>> > >>
>> > >> VM_BIND (out_fence=fence1)
>> > >>
>> > >> VM_BIND (out_fence=fence2)
>> > >>
>> > >> VM_BIND (out_fence=fence3)
>> > >>
>> > >>
>> > >> In what order are the fences going to be signaled?
>> > >>
>> > >> In the order of VM_BIND ioctls? Or out of order?
>> > >>
>> > >> Because you wrote "serialized I assume it's : in order
>> > >>
>> >
>> > Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
>> unbind
>> > will use
>> > the same queue and hence are ordered.
>> >
>> > >>
>> > >> One thing I didn't realize is that because we only get one
>> > "VM_BIND" engine,
>> > >> there is a disconnect from the Vulkan specification.
>> > >>
>> > >> In Vulkan VM_BIND operations are serialized but per engine.
>> > >>
>> > >> So you could have something like this :
>> > >>
>> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>> > >>
>> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>> > >>
>> > >>
>> > >> fence1 is not signaled
>> > >>
>> > >> fence3 is signaled
>> > >>
>> > >> So the second VM_BIND will proceed before the first VM_BIND.
>> > >>
>> > >>
>> > >> I guess we can deal with that scenario in userspace by doing
>> the
>> > wait
>> > >> ourselves in one thread per engines.
>> > >>
>> > >> But then it makes the VM_BIND input fences useless.
>> > >>
>> > >>
>> > >> Daniel : what do you think? Should be rework this or just
>> deal with
>> > wait
>> > >> fences in userspace?
>> > >>
>> > >
>> > >My opinion is rework this but make the ordering via an engine
>> param
>> > optional.
>> > >
>> > >e.g. A VM can be configured so all binds are ordered within the
>> VM
>> > >
>> > >e.g. A VM can be configured so all binds accept an engine
>> argument
>> > (in
>> > >the case of the i915 likely this is a gem context handle) and
>> binds
>> > >ordered with respect to that engine.
>> > >
>> > >This gives UMDs options as the later likely consumes more KMD
>> > resources
>> > >so if a different UMD can live with binds being ordered within
>> the VM
>> > >they can use a mode consuming less resources.
>> > >
>> >
>> > I think we need to be careful here if we are looking for some
>> out of
>> > (submission) order completion of vm_bind/unbind.
>> > In-order completion means, in a batch of binds and unbinds to be
>> > completed in-order, user only needs to specify in-fence for the
>> > first bind/unbind call and the our-fence for the last
>> bind/unbind
>> > call. Also, the VA released by an unbind call can be re-used by
>> > any subsequent bind call in that in-order batch.
>> >
>> > These things will break if binding/unbinding were to be allowed
>> to
>> > go out of order (of submission) and user need to be extra
>> careful
>> > not to run into pre-mature triggereing of out-fence and bind
>> failing
>> > as VA is still in use etc.
>> >
>> > Also, VM_BIND binds the provided mapping on the specified
>> address
>> > space
>> > (VM). So, the uapi is not engine/context specific.
>> >
>> > We can however add a 'queue' to the uapi which can be one from
>> the
>> > pre-defined queues,
>> > I915_VM_BIND_QUEUE_0
>> > I915_VM_BIND_QUEUE_1
>> > ...
>> > I915_VM_BIND_QUEUE_(N-1)
>> >
>> > KMD will spawn an async work queue for each queue which will
>> only
>> > bind the mappings on that queue in the order of submission.
>> > User can assign the queue to per engine or anything like that.
>> >
>> > But again here, user need to be careful and not deadlock these
>> > queues with circular dependency of fences.
>> >
>> > I prefer adding this later an as extension based on whether it
>> > is really helping with the implementation.
>> >
>> > I can tell you right now that having everything on a single
>> in-order
>> > queue will not get us the perf we want. What vulkan really wants
>> is one
>> > of two things:
>> > 1. No implicit ordering of VM_BIND ops. They just happen in
>> whatever
>> > their dependencies are resolved and we ensure ordering ourselves
>> by
>> > having a syncobj in the VkQueue.
>> > 2. The ability to create multiple VM_BIND queues. We need at
>> least 2
>> > but I don't see why there needs to be a limit besides the limits
>> the
>> > i915 API already has on the number of engines. Vulkan could
>> expose
>> > multiple sparse binding queues to the client if it's not
>> arbitrarily
>> > limited.
>>
>> Thanks Jason, Lionel.
>>
>> Jason, what are you referring to when you say "limits the i915 API
>> already
>> has on the number of engines"? I am not sure if there is such an uapi
>> today.
>>
>> There's a limit of something like 64 total engines today based on the
>> number of bits we can cram into the exec flags in execbuffer2. I think
>> someone had an extended version that allowed more but I ripped it out
>> because no one was using it. Of course, execbuffer3 might not have that
>> problem at all.
>>
>
>Thanks Jason.
>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
>and somehow export it to user (I am thinking of embedding it in
>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
>queues.
Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
will also have. So, we can simply define in vm_bind/unbind structures,
#define I915_VM_BIND_MAX_QUEUE 64
__u32 queue;
I think that will keep things simple.
Niranjana
>
>> I am trying to see how many queues we need and don't want it to be
>> arbitrarily
>> large and unduely blow up memory usage and complexity in i915 driver.
>>
>> I expect a Vulkan driver to use at most 2 in the vast majority of cases. I
>> could imagine a client wanting to create more than 1 sparse queue in which
>> case, it'll be N+1 but that's unlikely. As far as complexity goes, once
>> you allow two, I don't think the complexity is going up by allowing N. As
>> for memory usage, creating more queues means more memory. That's a
>> trade-off that userspace can make. Again, the expected number here is 1
>> or 2 in the vast majority of cases so I don't think you need to worry.
>
>Ok, will start with n=3 meaning 8 queues.
>That would require us create 8 workqueues.
>We can change 'n' later if required.
>
>Niranjana
>
>>
>> > Why? Because Vulkan has two basic kind of bind operations and we
>> don't
>> > want any dependencies between them:
>> > 1. Immediate. These happen right after BO creation or maybe as
>> part of
>> > vkBindImageMemory() or VkBindBufferMemory(). These don't happen
>> on a
>> > queue and we don't want them serialized with anything. To
>> synchronize
>> > with submit, we'll have a syncobj in the VkDevice which is
>> signaled by
>> > all immediate bind operations and make submits wait on it.
>> > 2. Queued (sparse): These happen on a VkQueue which may be the
>> same as
>> > a render/compute queue or may be its own queue. It's up to us
>> what we
>> > want to advertise. From the Vulkan API PoV, this is like any
>> other
>> > queue. Operations on it wait on and signal semaphores. If we
>> have a
>> > VM_BIND engine, we'd provide syncobjs to wait and signal just like
>> we do
>> > in execbuf().
>> > The important thing is that we don't want one type of operation to
>> block
>> > on the other. If immediate binds are blocking on sparse binds,
>> it's
>> > going to cause over-synchronization issues.
>> > In terms of the internal implementation, I know that there's going
>> to be
>> > a lock on the VM and that we can't actually do these things in
>> > parallel. That's fine. Once the dma_fences have signaled and
>> we're
>>
>> Thats correct. It is like a single VM_BIND engine with multiple queues
>> feeding to it.
>>
>> Right. As long as the queues themselves are independent and can block on
>> dma_fences without holding up other queues, I think we're fine.
>>
>> > unblocked to do the bind operation, I don't care if there's a bit
>> of
>> > synchronization due to locking. That's expected. What we can't
>> afford
>> > to have is an immediate bind operation suddenly blocking on a
>> sparse
>> > operation which is blocked on a compute job that's going to run
>> for
>> > another 5ms.
>>
>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>> VM_BIND
>> on other VMs. I am not sure about usecases here, but just wanted to
>> clarify.
>>
>> Yes, that's what I would expect.
>> --Jason
>>
>> Niranjana
>>
>> > For reference, Windows solves this by allowing arbitrarily many
>> paging
>> > queues (what they call a VM_BIND engine/queue). That design works
>> > pretty well and solves the problems in question. Again, we could
>> just
>> > make everything out-of-order and require using syncobjs to order
>> things
>> > as userspace wants. That'd be fine too.
>> > One more note while I'm here: danvet said something on IRC about
>> VM_BIND
>> > queues waiting for syncobjs to materialize. We don't really
>> want/need
>> > this. We already have all the machinery in userspace to handle
>> > wait-before-signal and waiting for syncobj fences to materialize
>> and
>> > that machinery is on by default. It would actually take MORE work
>> in
>> > Mesa to turn it off and take advantage of the kernel being able to
>> wait
>> > for syncobjs to materialize. Also, getting that right is
>> ridiculously
>> > hard and I really don't want to get it wrong in kernel
>>space. When we
>> > do memory fences, wait-before-signal will be a thing. We don't
>> need to
>> > try and make it a thing for syncobj.
>> > --Jason
>> >
>> > Thanks Jason,
>> >
>> > I missed the bit in the Vulkan spec that we're allowed to have a
>> sparse
>> > queue that does not implement either graphics or compute operations
>> :
>> >
>> > "While some implementations may include
>> VK_QUEUE_SPARSE_BINDING_BIT
>> > support in queue families that also include
>> >
>> > graphics and compute support, other implementations may only
>> expose a
>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>> >
>> > family."
>> >
>> > So it can all be all a vm_bind engine that just does bind/unbind
>> > operations.
>> >
>> > But yes we need another engine for the immediate/non-sparse
>> operations.
>> >
>> > -Lionel
>> >
>> > >
>> > Daniel, any thoughts?
>> >
>> > Niranjana
>> >
>> > >Matt
>> > >
>> > >>
>> > >> Sorry I noticed this late.
>> > >>
>> > >>
>> > >> -Lionel
>> > >>
>> > >>
More information about the Intel-gfx
mailing list