[Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Wed Jun 8 07:33:25 UTC 2022
On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
> On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
>> On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>> <niranjana.vishwanathapura at intel.com> wrote:
>>>
>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>>> > On 02/06/2022 23:35, Jason Ekstrand wrote:
>>> >
>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>>> > <niranjana.vishwanathapura at intel.com> wrote:
>>> >
>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost
>>> wrote:
>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
>>> wrote:
>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>> > >> > +VM_BIND/UNBIND ioctl will immediately start
>>> binding/unbinding
>>> > the mapping in an
>>> > >> > +async worker. The binding and unbinding will work
>>> like a
>>> special
>>> > GPU engine.
>>> > >> > +The binding and unbinding operations are serialized and
>>> will
>>> > wait on specified
>>> > >> > +input fences before the operation and will signal the
>>> output
>>> > fences upon the
>>> > >> > +completion of the operation. Due to serialization,
>>> completion of
>>> > an operation
>>> > >> > +will also indicate that all previous operations are
>>> also
>>> > complete.
>>> > >>
>>> > >> I guess we should avoid saying "will immediately start
>>> > binding/unbinding" if
>>> > >> there are fences involved.
>>> > >>
>>> > >> And the fact that it's happening in an async worker
>>> seem to
>>> imply
>>> > it's not
>>> > >> immediate.
>>> > >>
>>> >
>>> > Ok, will fix.
>>> > This was added because in earlier design binding was deferred
>>> until
>>> > next execbuff.
>>> > But now it is non-deferred (immediate in that sense). But
>>> yah,
>>> this is
>>> > confusing
>>> > and will fix it.
>>> >
>>> > >>
>>> > >> I have a question on the behavior of the bind operation
>>> when
>>> no
>>> > input fence
>>> > >> is provided. Let say I do :
>>> > >>
>>> > >> VM_BIND (out_fence=fence1)
>>> > >>
>>> > >> VM_BIND (out_fence=fence2)
>>> > >>
>>> > >> VM_BIND (out_fence=fence3)
>>> > >>
>>> > >>
>>> > >> In what order are the fences going to be signaled?
>>> > >>
>>> > >> In the order of VM_BIND ioctls? Or out of order?
>>> > >>
>>> > >> Because you wrote "serialized I assume it's : in order
>>> > >>
>>> >
>>> > Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind
>>> and
>>> unbind
>>> > will use
>>> > the same queue and hence are ordered.
>>> >
>>> > >>
>>> > >> One thing I didn't realize is that because we only get one
>>> > "VM_BIND" engine,
>>> > >> there is a disconnect from the Vulkan specification.
>>> > >>
>>> > >> In Vulkan VM_BIND operations are serialized but per
>>> engine.
>>> > >>
>>> > >> So you could have something like this :
>>> > >>
>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>> > >>
>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>> > >>
>>> > >>
>>> > >> fence1 is not signaled
>>> > >>
>>> > >> fence3 is signaled
>>> > >>
>>> > >> So the second VM_BIND will proceed before the first
>>> VM_BIND.
>>> > >>
>>> > >>
>>> > >> I guess we can deal with that scenario in userspace by
>>> doing
>>> the
>>> > wait
>>> > >> ourselves in one thread per engines.
>>> > >>
>>> > >> But then it makes the VM_BIND input fences useless.
>>> > >>
>>> > >>
>>> > >> Daniel : what do you think? Should be rework this or just
>>> deal with
>>> > wait
>>> > >> fences in userspace?
>>> > >>
>>> > >
>>> > >My opinion is rework this but make the ordering via an
>>> engine
>>> param
>>> > optional.
>>> > >
>>> > >e.g. A VM can be configured so all binds are ordered
>>> within the
>>> VM
>>> > >
>>> > >e.g. A VM can be configured so all binds accept an engine
>>> argument
>>> > (in
>>> > >the case of the i915 likely this is a gem context handle)
>>> and
>>> binds
>>> > >ordered with respect to that engine.
>>> > >
>>> > >This gives UMDs options as the later likely consumes more
>>> KMD
>>> > resources
>>> > >so if a different UMD can live with binds being ordered
>>> within
>>> the VM
>>> > >they can use a mode consuming less resources.
>>> > >
>>> >
>>> > I think we need to be careful here if we are looking for some
>>> out of
>>> > (submission) order completion of vm_bind/unbind.
>>> > In-order completion means, in a batch of binds and unbinds
>>> to be
>>> > completed in-order, user only needs to specify in-fence
>>> for the
>>> > first bind/unbind call and the our-fence for the last
>>> bind/unbind
>>> > call. Also, the VA released by an unbind call can be
>>> re-used by
>>> > any subsequent bind call in that in-order batch.
>>> >
>>> > These things will break if binding/unbinding were to be
>>> allowed
>>> to
>>> > go out of order (of submission) and user need to be extra
>>> careful
>>> > not to run into pre-mature triggereing of out-fence and bind
>>> failing
>>> > as VA is still in use etc.
>>> >
>>> > Also, VM_BIND binds the provided mapping on the specified
>>> address
>>> > space
>>> > (VM). So, the uapi is not engine/context specific.
>>> >
>>> > We can however add a 'queue' to the uapi which can be one
>>> from
>>> the
>>> > pre-defined queues,
>>> > I915_VM_BIND_QUEUE_0
>>> > I915_VM_BIND_QUEUE_1
>>> > ...
>>> > I915_VM_BIND_QUEUE_(N-1)
>>> >
>>> > KMD will spawn an async work queue for each queue which will
>>> only
>>> > bind the mappings on that queue in the order of submission.
>>> > User can assign the queue to per engine or anything like
>>> that.
>>> >
>>> > But again here, user need to be careful and not deadlock
>>> these
>>> > queues with circular dependency of fences.
>>> >
>>> > I prefer adding this later an as extension based on
>>> whether it
>>> > is really helping with the implementation.
>>> >
>>> > I can tell you right now that having everything on a single
>>> in-order
>>> > queue will not get us the perf we want. What vulkan really
>>> wants
>>> is one
>>> > of two things:
>>> > 1. No implicit ordering of VM_BIND ops. They just happen in
>>> whatever
>>> > their dependencies are resolved and we ensure ordering
>>> ourselves
>>> by
>>> > having a syncobj in the VkQueue.
>>> > 2. The ability to create multiple VM_BIND queues. We need at
>>> least 2
>>> > but I don't see why there needs to be a limit besides the
>>> limits
>>> the
>>> > i915 API already has on the number of engines. Vulkan could
>>> expose
>>> > multiple sparse binding queues to the client if it's not
>>> arbitrarily
>>> > limited.
>>>
>>> Thanks Jason, Lionel.
>>>
>>> Jason, what are you referring to when you say "limits the i915 API
>>> already
>>> has on the number of engines"? I am not sure if there is such an uapi
>>> today.
>>>
>>> There's a limit of something like 64 total engines today based on the
>>> number of bits we can cram into the exec flags in execbuffer2. I think
>>> someone had an extended version that allowed more but I ripped it out
>>> because no one was using it. Of course, execbuffer3 might not have
>>> that
>>> problem at all.
>>>
>>
>> Thanks Jason.
>> Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably
>> will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE
>> and somehow export it to user (I am thinking of embedding it in
>> I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n
>> queues.
>
> Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
> will also have. So, we can simply define in vm_bind/unbind structures,
>
> #define I915_VM_BIND_MAX_QUEUE 64
> __u32 queue;
>
> I think that will keep things simple.
Hmmm? What does execbuf2 limit has to do with how many engines hardware
can have? I suggest not to do that.
Change with added this:
if (set.num_engines > I915_EXEC_RING_MASK + 1)
return -EINVAL;
To context creation needs to be undone and so let users create engine
maps with all hardware engines, and let execbuf3 access them all.
Regards,
Tvrtko
>
> Niranjana
>
>>
>>> I am trying to see how many queues we need and don't want it to be
>>> arbitrarily
>>> large and unduely blow up memory usage and complexity in i915 driver.
>>>
>>> I expect a Vulkan driver to use at most 2 in the vast majority of
>>> cases. I
>>> could imagine a client wanting to create more than 1 sparse queue in
>>> which
>>> case, it'll be N+1 but that's unlikely. As far as complexity goes,
>>> once
>>> you allow two, I don't think the complexity is going up by allowing
>>> N. As
>>> for memory usage, creating more queues means more memory. That's a
>>> trade-off that userspace can make. Again, the expected number here
>>> is 1
>>> or 2 in the vast majority of cases so I don't think you need to worry.
>>
>> Ok, will start with n=3 meaning 8 queues.
>> That would require us create 8 workqueues.
>> We can change 'n' later if required.
>>
>> Niranjana
>>
>>>
>>> > Why? Because Vulkan has two basic kind of bind operations
>>> and we
>>> don't
>>> > want any dependencies between them:
>>> > 1. Immediate. These happen right after BO creation or
>>> maybe as
>>> part of
>>> > vkBindImageMemory() or VkBindBufferMemory(). These don't
>>> happen
>>> on a
>>> > queue and we don't want them serialized with anything. To
>>> synchronize
>>> > with submit, we'll have a syncobj in the VkDevice which is
>>> signaled by
>>> > all immediate bind operations and make submits wait on it.
>>> > 2. Queued (sparse): These happen on a VkQueue which may be the
>>> same as
>>> > a render/compute queue or may be its own queue. It's up to us
>>> what we
>>> > want to advertise. From the Vulkan API PoV, this is like any
>>> other
>>> > queue. Operations on it wait on and signal semaphores. If we
>>> have a
>>> > VM_BIND engine, we'd provide syncobjs to wait and signal
>>> just like
>>> we do
>>> > in execbuf().
>>> > The important thing is that we don't want one type of
>>> operation to
>>> block
>>> > on the other. If immediate binds are blocking on sparse binds,
>>> it's
>>> > going to cause over-synchronization issues.
>>> > In terms of the internal implementation, I know that there's
>>> going
>>> to be
>>> > a lock on the VM and that we can't actually do these things in
>>> > parallel. That's fine. Once the dma_fences have signaled and
>>> we're
>>>
>>> Thats correct. It is like a single VM_BIND engine with multiple
>>> queues
>>> feeding to it.
>>>
>>> Right. As long as the queues themselves are independent and can
>>> block on
>>> dma_fences without holding up other queues, I think we're fine.
>>>
>>> > unblocked to do the bind operation, I don't care if there's
>>> a bit
>>> of
>>> > synchronization due to locking. That's expected. What we
>>> can't
>>> afford
>>> > to have is an immediate bind operation suddenly blocking on a
>>> sparse
>>> > operation which is blocked on a compute job that's going to run
>>> for
>>> > another 5ms.
>>>
>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
>>> VM_BIND
>>> on other VMs. I am not sure about usecases here, but just wanted to
>>> clarify.
>>>
>>> Yes, that's what I would expect.
>>> --Jason
>>>
>>> Niranjana
>>>
>>> > For reference, Windows solves this by allowing arbitrarily many
>>> paging
>>> > queues (what they call a VM_BIND engine/queue). That design
>>> works
>>> > pretty well and solves the problems in question. Again, we
>>> could
>>> just
>>> > make everything out-of-order and require using syncobjs to
>>> order
>>> things
>>> > as userspace wants. That'd be fine too.
>>> > One more note while I'm here: danvet said something on IRC
>>> about
>>> VM_BIND
>>> > queues waiting for syncobjs to materialize. We don't really
>>> want/need
>>> > this. We already have all the machinery in userspace to handle
>>> > wait-before-signal and waiting for syncobj fences to
>>> materialize
>>> and
>>> > that machinery is on by default. It would actually take
>>> MORE work
>>> in
>>> > Mesa to turn it off and take advantage of the kernel being
>>> able to
>>> wait
>>> > for syncobjs to materialize. Also, getting that right is
>>> ridiculously
>>> > hard and I really don't want to get it wrong in kernel
>>> space. When we
>>> > do memory fences, wait-before-signal will be a thing. We don't
>>> need to
>>> > try and make it a thing for syncobj.
>>> > --Jason
>>> >
>>> > Thanks Jason,
>>> >
>>> > I missed the bit in the Vulkan spec that we're allowed to have a
>>> sparse
>>> > queue that does not implement either graphics or compute
>>> operations
>>> :
>>> >
>>> > "While some implementations may include
>>> VK_QUEUE_SPARSE_BINDING_BIT
>>> > support in queue families that also include
>>> >
>>> > graphics and compute support, other implementations may only
>>> expose a
>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>> >
>>> > family."
>>> >
>>> > So it can all be all a vm_bind engine that just does bind/unbind
>>> > operations.
>>> >
>>> > But yes we need another engine for the immediate/non-sparse
>>> operations.
>>> >
>>> > -Lionel
>>> >
>>> > >
>>> > Daniel, any thoughts?
>>> >
>>> > Niranjana
>>> >
>>> > >Matt
>>> > >
>>> > >>
>>> > >> Sorry I noticed this late.
>>> > >>
>>> > >>
>>> > >> -Lionel
>>> > >>
>>> > >>
More information about the Intel-gfx
mailing list