[Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
Jason Ekstrand
jason at jlekstrand.net
Wed Jun 8 21:55:38 UTC 2022
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura <
niranjana.vishwanathapura at intel.com> wrote:
> On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
> >
> >
> >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura
> wrote:
> >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
> >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
> >>>> <niranjana.vishwanathapura at intel.com> wrote:
> >>>>
> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
> >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote:
> >>>> >
> >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
> >>>> > <niranjana.vishwanathapura at intel.com> wrote:
> >>>> >
> >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
> >>>>Brost wrote:
> >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
> >>>> wrote:
> >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
> >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start
> >>>> binding/unbinding
> >>>> > the mapping in an
> >>>> > >> > +async worker. The binding and unbinding will
> >>>>work like a
> >>>> special
> >>>> > GPU engine.
> >>>> > >> > +The binding and unbinding operations are serialized
> and
> >>>> will
> >>>> > wait on specified
> >>>> > >> > +input fences before the operation and will signal the
> >>>> output
> >>>> > fences upon the
> >>>> > >> > +completion of the operation. Due to serialization,
> >>>> completion of
> >>>> > an operation
> >>>> > >> > +will also indicate that all previous operations
> >>>>are also
> >>>> > complete.
> >>>> > >>
> >>>> > >> I guess we should avoid saying "will immediately start
> >>>> > binding/unbinding" if
> >>>> > >> there are fences involved.
> >>>> > >>
> >>>> > >> And the fact that it's happening in an async
> >>>>worker seem to
> >>>> imply
> >>>> > it's not
> >>>> > >> immediate.
> >>>> > >>
> >>>> >
> >>>> > Ok, will fix.
> >>>> > This was added because in earlier design binding was
> deferred
> >>>> until
> >>>> > next execbuff.
> >>>> > But now it is non-deferred (immediate in that sense).
> >>>>But yah,
> >>>> this is
> >>>> > confusing
> >>>> > and will fix it.
> >>>> >
> >>>> > >>
> >>>> > >> I have a question on the behavior of the bind
> >>>>operation when
> >>>> no
> >>>> > input fence
> >>>> > >> is provided. Let say I do :
> >>>> > >>
> >>>> > >> VM_BIND (out_fence=fence1)
> >>>> > >>
> >>>> > >> VM_BIND (out_fence=fence2)
> >>>> > >>
> >>>> > >> VM_BIND (out_fence=fence3)
> >>>> > >>
> >>>> > >>
> >>>> > >> In what order are the fences going to be signaled?
> >>>> > >>
> >>>> > >> In the order of VM_BIND ioctls? Or out of order?
> >>>> > >>
> >>>> > >> Because you wrote "serialized I assume it's : in order
> >>>> > >>
> >>>> >
> >>>> > Yes, in the order of VM_BIND/UNBIND ioctls. Note that
> >>>>bind and
> >>>> unbind
> >>>> > will use
> >>>> > the same queue and hence are ordered.
> >>>> >
> >>>> > >>
> >>>> > >> One thing I didn't realize is that because we only get
> one
> >>>> > "VM_BIND" engine,
> >>>> > >> there is a disconnect from the Vulkan specification.
> >>>> > >>
> >>>> > >> In Vulkan VM_BIND operations are serialized but
> >>>>per engine.
> >>>> > >>
> >>>> > >> So you could have something like this :
> >>>> > >>
> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
> >>>> > >>
> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
> >>>> > >>
> >>>> > >>
> >>>> > >> fence1 is not signaled
> >>>> > >>
> >>>> > >> fence3 is signaled
> >>>> > >>
> >>>> > >> So the second VM_BIND will proceed before the
> >>>>first VM_BIND.
> >>>> > >>
> >>>> > >>
> >>>> > >> I guess we can deal with that scenario in
> >>>>userspace by doing
> >>>> the
> >>>> > wait
> >>>> > >> ourselves in one thread per engines.
> >>>> > >>
> >>>> > >> But then it makes the VM_BIND input fences useless.
> >>>> > >>
> >>>> > >>
> >>>> > >> Daniel : what do you think? Should be rework this or just
> >>>> deal with
> >>>> > wait
> >>>> > >> fences in userspace?
> >>>> > >>
> >>>> > >
> >>>> > >My opinion is rework this but make the ordering via
> >>>>an engine
> >>>> param
> >>>> > optional.
> >>>> > >
> >>>> > >e.g. A VM can be configured so all binds are ordered
> >>>>within the
> >>>> VM
> >>>> > >
> >>>> > >e.g. A VM can be configured so all binds accept an engine
> >>>> argument
> >>>> > (in
> >>>> > >the case of the i915 likely this is a gem context
> >>>>handle) and
> >>>> binds
> >>>> > >ordered with respect to that engine.
> >>>> > >
> >>>> > >This gives UMDs options as the later likely consumes
> >>>>more KMD
> >>>> > resources
> >>>> > >so if a different UMD can live with binds being
> >>>>ordered within
> >>>> the VM
> >>>> > >they can use a mode consuming less resources.
> >>>> > >
> >>>> >
> >>>> > I think we need to be careful here if we are looking for
> some
> >>>> out of
> >>>> > (submission) order completion of vm_bind/unbind.
> >>>> > In-order completion means, in a batch of binds and
> >>>>unbinds to be
> >>>> > completed in-order, user only needs to specify
> >>>>in-fence for the
> >>>> > first bind/unbind call and the our-fence for the last
> >>>> bind/unbind
> >>>> > call. Also, the VA released by an unbind call can be
> >>>>re-used by
> >>>> > any subsequent bind call in that in-order batch.
> >>>> >
> >>>> > These things will break if binding/unbinding were to
> >>>>be allowed
> >>>> to
> >>>> > go out of order (of submission) and user need to be extra
> >>>> careful
> >>>> > not to run into pre-mature triggereing of out-fence and bind
> >>>> failing
> >>>> > as VA is still in use etc.
> >>>> >
> >>>> > Also, VM_BIND binds the provided mapping on the specified
> >>>> address
> >>>> > space
> >>>> > (VM). So, the uapi is not engine/context specific.
> >>>> >
> >>>> > We can however add a 'queue' to the uapi which can be
> >>>>one from
> >>>> the
> >>>> > pre-defined queues,
> >>>> > I915_VM_BIND_QUEUE_0
> >>>> > I915_VM_BIND_QUEUE_1
> >>>> > ...
> >>>> > I915_VM_BIND_QUEUE_(N-1)
> >>>> >
> >>>> > KMD will spawn an async work queue for each queue which will
> >>>> only
> >>>> > bind the mappings on that queue in the order of submission.
> >>>> > User can assign the queue to per engine or anything
> >>>>like that.
> >>>> >
> >>>> > But again here, user need to be careful and not
> >>>>deadlock these
> >>>> > queues with circular dependency of fences.
> >>>> >
> >>>> > I prefer adding this later an as extension based on
> >>>>whether it
> >>>> > is really helping with the implementation.
> >>>> >
> >>>> > I can tell you right now that having everything on a single
> >>>> in-order
> >>>> > queue will not get us the perf we want. What vulkan
> >>>>really wants
> >>>> is one
> >>>> > of two things:
> >>>> > 1. No implicit ordering of VM_BIND ops. They just happen in
> >>>> whatever
> >>>> > their dependencies are resolved and we ensure ordering
> >>>>ourselves
> >>>> by
> >>>> > having a syncobj in the VkQueue.
> >>>> > 2. The ability to create multiple VM_BIND queues. We need at
> >>>> least 2
> >>>> > but I don't see why there needs to be a limit besides
> >>>>the limits
> >>>> the
> >>>> > i915 API already has on the number of engines. Vulkan could
> >>>> expose
> >>>> > multiple sparse binding queues to the client if it's not
> >>>> arbitrarily
> >>>> > limited.
> >>>>
> >>>> Thanks Jason, Lionel.
> >>>>
> >>>> Jason, what are you referring to when you say "limits the i915 API
> >>>> already
> >>>> has on the number of engines"? I am not sure if there is such an
> uapi
> >>>> today.
> >>>>
> >>>> There's a limit of something like 64 total engines today based on the
> >>>> number of bits we can cram into the exec flags in execbuffer2. I
> think
> >>>> someone had an extended version that allowed more but I ripped it out
> >>>> because no one was using it. Of course, execbuffer3 might not
> >>>>have that
> >>>> problem at all.
> >>>>
> >>>
> >>>Thanks Jason.
> >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
> >>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
> >>>and somehow export it to user (I am thinking of embedding it in
> >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
> >>>queues.
> >>
> >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which
> execbuf3
>
Yup! That's exactly the limit I was talking about.
> >>will also have. So, we can simply define in vm_bind/unbind structures,
> >>
> >>#define I915_VM_BIND_MAX_QUEUE 64
> >> __u32 queue;
> >>
> >>I think that will keep things simple.
> >
> >Hmmm? What does execbuf2 limit has to do with how many engines
> >hardware can have? I suggest not to do that.
> >
> >Change with added this:
> >
> > if (set.num_engines > I915_EXEC_RING_MASK + 1)
> > return -EINVAL;
> >
> >To context creation needs to be undone and so let users create engine
> >maps with all hardware engines, and let execbuf3 access them all.
> >
>
> Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
> Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
> make it N+1).
> But, as discussed in other thread of this RFC series, we are planning
> to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
> any uapi that limits the number of engines (and hence the vm_bind queues
> need to be supported).
>
> If we leave the number of vm_bind queues to be arbitrarily large
> (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
> work_item and a linked list) lookup from the user specified queue index.
> Other option is to just put some hard limit (say 64 or 65) and use
> an array of queues in VM (each created upon first use). I prefer this.
>
I don't get why a VM_BIND queue is any different from any other queue or
userspace-visible kernel object. But I'll leave those details up to danvet
or whoever else might be reviewing the implementation.
--Jason
>
> Niranjana
>
> >Regards,
> >
> >Tvrtko
> >
> >>
> >>Niranjana
> >>
> >>>
> >>>> I am trying to see how many queues we need and don't want it to be
> >>>> arbitrarily
> >>>> large and unduely blow up memory usage and complexity in i915
> driver.
> >>>>
> >>>> I expect a Vulkan driver to use at most 2 in the vast majority
> >>>>of cases. I
> >>>> could imagine a client wanting to create more than 1 sparse
> >>>>queue in which
> >>>> case, it'll be N+1 but that's unlikely. As far as complexity
> >>>>goes, once
> >>>> you allow two, I don't think the complexity is going up by
> >>>>allowing N. As
> >>>> for memory usage, creating more queues means more memory. That's a
> >>>> trade-off that userspace can make. Again, the expected number
> >>>>here is 1
> >>>> or 2 in the vast majority of cases so I don't think you need to worry.
> >>>
> >>>Ok, will start with n=3 meaning 8 queues.
> >>>That would require us create 8 workqueues.
> >>>We can change 'n' later if required.
> >>>
> >>>Niranjana
> >>>
> >>>>
> >>>> > Why? Because Vulkan has two basic kind of bind
> >>>>operations and we
> >>>> don't
> >>>> > want any dependencies between them:
> >>>> > 1. Immediate. These happen right after BO creation or
> >>>>maybe as
> >>>> part of
> >>>> > vkBindImageMemory() or VkBindBufferMemory(). These
> >>>>don't happen
> >>>> on a
> >>>> > queue and we don't want them serialized with anything. To
> >>>> synchronize
> >>>> > with submit, we'll have a syncobj in the VkDevice which is
> >>>> signaled by
> >>>> > all immediate bind operations and make submits wait on it.
> >>>> > 2. Queued (sparse): These happen on a VkQueue which may be
> the
> >>>> same as
> >>>> > a render/compute queue or may be its own queue. It's up to us
> >>>> what we
> >>>> > want to advertise. From the Vulkan API PoV, this is like any
> >>>> other
> >>>> > queue. Operations on it wait on and signal semaphores. If we
> >>>> have a
> >>>> > VM_BIND engine, we'd provide syncobjs to wait and
> >>>>signal just like
> >>>> we do
> >>>> > in execbuf().
> >>>> > The important thing is that we don't want one type of
> >>>>operation to
> >>>> block
> >>>> > on the other. If immediate binds are blocking on sparse
> binds,
> >>>> it's
> >>>> > going to cause over-synchronization issues.
> >>>> > In terms of the internal implementation, I know that
> >>>>there's going
> >>>> to be
> >>>> > a lock on the VM and that we can't actually do these things in
> >>>> > parallel. That's fine. Once the dma_fences have signaled and
> >>>> we're
> >>>>
> >>>> Thats correct. It is like a single VM_BIND engine with
> >>>>multiple queues
> >>>> feeding to it.
> >>>>
> >>>> Right. As long as the queues themselves are independent and
> >>>>can block on
> >>>> dma_fences without holding up other queues, I think we're fine.
> >>>>
> >>>> > unblocked to do the bind operation, I don't care if
> >>>>there's a bit
> >>>> of
> >>>> > synchronization due to locking. That's expected. What
> >>>>we can't
> >>>> afford
> >>>> > to have is an immediate bind operation suddenly blocking on a
> >>>> sparse
> >>>> > operation which is blocked on a compute job that's going to
> run
> >>>> for
> >>>> > another 5ms.
> >>>>
> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
> >>>> VM_BIND
> >>>> on other VMs. I am not sure about usecases here, but just wanted to
> >>>> clarify.
> >>>>
> >>>> Yes, that's what I would expect.
> >>>> --Jason
> >>>>
> >>>> Niranjana
> >>>>
> >>>> > For reference, Windows solves this by allowing arbitrarily
> many
> >>>> paging
> >>>> > queues (what they call a VM_BIND engine/queue). That
> >>>>design works
> >>>> > pretty well and solves the problems in question.
> >>>>Again, we could
> >>>> just
> >>>> > make everything out-of-order and require using syncobjs
> >>>>to order
> >>>> things
> >>>> > as userspace wants. That'd be fine too.
> >>>> > One more note while I'm here: danvet said something on
> >>>>IRC about
> >>>> VM_BIND
> >>>> > queues waiting for syncobjs to materialize. We don't really
> >>>> want/need
> >>>> > this. We already have all the machinery in userspace to
> handle
> >>>> > wait-before-signal and waiting for syncobj fences to
> >>>>materialize
> >>>> and
> >>>> > that machinery is on by default. It would actually
> >>>>take MORE work
> >>>> in
> >>>> > Mesa to turn it off and take advantage of the kernel
> >>>>being able to
> >>>> wait
> >>>> > for syncobjs to materialize. Also, getting that right is
> >>>> ridiculously
> >>>> > hard and I really don't want to get it wrong in kernel
> >>>>space. When we
> >>>> > do memory fences, wait-before-signal will be a thing. We
> don't
> >>>> need to
> >>>> > try and make it a thing for syncobj.
> >>>> > --Jason
> >>>> >
> >>>> > Thanks Jason,
> >>>> >
> >>>> > I missed the bit in the Vulkan spec that we're allowed to have a
> >>>> sparse
> >>>> > queue that does not implement either graphics or compute
> >>>>operations
> >>>> :
> >>>> >
> >>>> > "While some implementations may include
> >>>> VK_QUEUE_SPARSE_BINDING_BIT
> >>>> > support in queue families that also include
> >>>> >
> >>>> > graphics and compute support, other implementations may only
> >>>> expose a
> >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue
> >>>> >
> >>>> > family."
> >>>> >
> >>>> > So it can all be all a vm_bind engine that just does bind/unbind
> >>>> > operations.
> >>>> >
> >>>> > But yes we need another engine for the immediate/non-sparse
> >>>> operations.
> >>>> >
> >>>> > -Lionel
> >>>> >
> >>>> > >
> >>>> > Daniel, any thoughts?
> >>>> >
> >>>> > Niranjana
> >>>> >
> >>>> > >Matt
> >>>> > >
> >>>> > >>
> >>>> > >> Sorry I noticed this late.
> >>>> > >>
> >>>> > >>
> >>>> > >> -Lionel
> >>>> > >>
> >>>> > >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx/attachments/20220608/062c975b/attachment-0001.htm>
More information about the Intel-gfx
mailing list