[Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
Lionel Landwerlin
lionel.g.landwerlin at intel.com
Fri Jun 10 06:53:24 UTC 2022
On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
> On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
>> On 09/06/2022 00:55, Jason Ekstrand wrote:
>>
>> On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
>> <niranjana.vishwanathapura at intel.com> wrote:
>>
>> On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>> >
>> >
>> >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>> >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
>> Vishwanathapura
>> wrote:
>> >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>> >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>> >>>> <niranjana.vishwanathapura at intel.com> wrote:
>> >>>>
>> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
>> wrote:
>> >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote:
>> >>>> >
>> >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana
>> Vishwanathapura
>> >>>> > <niranjana.vishwanathapura at intel.com> wrote:
>> >>>> >
>> >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>> >>>>Brost wrote:
>> >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
>> Landwerlin
>> >>>> wrote:
>> >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura
>> wrote:
>> >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start
>> >>>> binding/unbinding
>> >>>> > the mapping in an
>> >>>> > >> > +async worker. The binding and unbinding will
>> >>>>work like a
>> >>>> special
>> >>>> > GPU engine.
>> >>>> > >> > +The binding and unbinding operations are
>> serialized and
>> >>>> will
>> >>>> > wait on specified
>> >>>> > >> > +input fences before the operation and will
>> signal
>> the
>> >>>> output
>> >>>> > fences upon the
>> >>>> > >> > +completion of the operation. Due to
>> serialization,
>> >>>> completion of
>> >>>> > an operation
>> >>>> > >> > +will also indicate that all previous
>> operations
>> >>>>are also
>> >>>> > complete.
>> >>>> > >>
>> >>>> > >> I guess we should avoid saying "will immediately
>> start
>> >>>> > binding/unbinding" if
>> >>>> > >> there are fences involved.
>> >>>> > >>
>> >>>> > >> And the fact that it's happening in an async
>> >>>>worker seem to
>> >>>> imply
>> >>>> > it's not
>> >>>> > >> immediate.
>> >>>> > >>
>> >>>> >
>> >>>> > Ok, will fix.
>> >>>> > This was added because in earlier design binding
>> was
>> deferred
>> >>>> until
>> >>>> > next execbuff.
>> >>>> > But now it is non-deferred (immediate in that
>> sense).
>> >>>>But yah,
>> >>>> this is
>> >>>> > confusing
>> >>>> > and will fix it.
>> >>>> >
>> >>>> > >>
>> >>>> > >> I have a question on the behavior of the bind
>> >>>>operation when
>> >>>> no
>> >>>> > input fence
>> >>>> > >> is provided. Let say I do :
>> >>>> > >>
>> >>>> > >> VM_BIND (out_fence=fence1)
>> >>>> > >>
>> >>>> > >> VM_BIND (out_fence=fence2)
>> >>>> > >>
>> >>>> > >> VM_BIND (out_fence=fence3)
>> >>>> > >>
>> >>>> > >>
>> >>>> > >> In what order are the fences going to be
>> signaled?
>> >>>> > >>
>> >>>> > >> In the order of VM_BIND ioctls? Or out of order?
>> >>>> > >>
>> >>>> > >> Because you wrote "serialized I assume it's : in
>> order
>> >>>> > >>
>> >>>> >
>> >>>> > Yes, in the order of VM_BIND/UNBIND ioctls. Note
>> that
>> >>>>bind and
>> >>>> unbind
>> >>>> > will use
>> >>>> > the same queue and hence are ordered.
>> >>>> >
>> >>>> > >>
>> >>>> > >> One thing I didn't realize is that because we
>> only
>> get one
>> >>>> > "VM_BIND" engine,
>> >>>> > >> there is a disconnect from the Vulkan
>> specification.
>> >>>> > >>
>> >>>> > >> In Vulkan VM_BIND operations are serialized but
>> >>>>per engine.
>> >>>> > >>
>> >>>> > >> So you could have something like this :
>> >>>> > >>
>> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1,
>> out_fence=fence2)
>> >>>> > >>
>> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3,
>> out_fence=fence4)
>> >>>> > >>
>> >>>> > >>
>> >>>> > >> fence1 is not signaled
>> >>>> > >>
>> >>>> > >> fence3 is signaled
>> >>>> > >>
>> >>>> > >> So the second VM_BIND will proceed before the
>> >>>>first VM_BIND.
>> >>>> > >>
>> >>>> > >>
>> >>>> > >> I guess we can deal with that scenario in
>> >>>>userspace by doing
>> >>>> the
>> >>>> > wait
>> >>>> > >> ourselves in one thread per engines.
>> >>>> > >>
>> >>>> > >> But then it makes the VM_BIND input fences
>> useless.
>> >>>> > >>
>> >>>> > >>
>> >>>> > >> Daniel : what do you think? Should be rework
>> this or
>> just
>> >>>> deal with
>> >>>> > wait
>> >>>> > >> fences in userspace?
>> >>>> > >>
>> >>>> > >
>> >>>> > >My opinion is rework this but make the ordering
>> via
>> >>>>an engine
>> >>>> param
>> >>>> > optional.
>> >>>> > >
>> >>>> > >e.g. A VM can be configured so all binds are
>> ordered
>> >>>>within the
>> >>>> VM
>> >>>> > >
>> >>>> > >e.g. A VM can be configured so all binds accept an
>> engine
>> >>>> argument
>> >>>> > (in
>> >>>> > >the case of the i915 likely this is a gem context
>> >>>>handle) and
>> >>>> binds
>> >>>> > >ordered with respect to that engine.
>> >>>> > >
>> >>>> > >This gives UMDs options as the later likely
>> consumes
>> >>>>more KMD
>> >>>> > resources
>> >>>> > >so if a different UMD can live with binds being
>> >>>>ordered within
>> >>>> the VM
>> >>>> > >they can use a mode consuming less resources.
>> >>>> > >
>> >>>> >
>> >>>> > I think we need to be careful here if we are
>> looking
>> for some
>> >>>> out of
>> >>>> > (submission) order completion of vm_bind/unbind.
>> >>>> > In-order completion means, in a batch of binds and
>> >>>>unbinds to be
>> >>>> > completed in-order, user only needs to specify
>> >>>>in-fence for the
>> >>>> > first bind/unbind call and the our-fence for the
>> last
>> >>>> bind/unbind
>> >>>> > call. Also, the VA released by an unbind call
>> can be
>> >>>>re-used by
>> >>>> > any subsequent bind call in that in-order batch.
>> >>>> >
>> >>>> > These things will break if binding/unbinding
>> were to
>> >>>>be allowed
>> >>>> to
>> >>>> > go out of order (of submission) and user need to be
>> extra
>> >>>> careful
>> >>>> > not to run into pre-mature triggereing of
>> out-fence and
>> bind
>> >>>> failing
>> >>>> > as VA is still in use etc.
>> >>>> >
>> >>>> > Also, VM_BIND binds the provided mapping on the
>> specified
>> >>>> address
>> >>>> > space
>> >>>> > (VM). So, the uapi is not engine/context specific.
>> >>>> >
>> >>>> > We can however add a 'queue' to the uapi which
>> can be
>> >>>>one from
>> >>>> the
>> >>>> > pre-defined queues,
>> >>>> > I915_VM_BIND_QUEUE_0
>> >>>> > I915_VM_BIND_QUEUE_1
>> >>>> > ...
>> >>>> > I915_VM_BIND_QUEUE_(N-1)
>> >>>> >
>> >>>> > KMD will spawn an async work queue for each
>> queue which
>> will
>> >>>> only
>> >>>> > bind the mappings on that queue in the order of
>> submission.
>> >>>> > User can assign the queue to per engine or anything
>> >>>>like that.
>> >>>> >
>> >>>> > But again here, user need to be careful and not
>> >>>>deadlock these
>> >>>> > queues with circular dependency of fences.
>> >>>> >
>> >>>> > I prefer adding this later an as extension based on
>> >>>>whether it
>> >>>> > is really helping with the implementation.
>> >>>> >
>> >>>> > I can tell you right now that having everything on a
>> single
>> >>>> in-order
>> >>>> > queue will not get us the perf we want. What vulkan
>> >>>>really wants
>> >>>> is one
>> >>>> > of two things:
>> >>>> > 1. No implicit ordering of VM_BIND ops. They just
>> happen in
>> >>>> whatever
>> >>>> > their dependencies are resolved and we ensure
>> ordering
>> >>>>ourselves
>> >>>> by
>> >>>> > having a syncobj in the VkQueue.
>> >>>> > 2. The ability to create multiple VM_BIND
>> queues. We
>> need at
>> >>>> least 2
>> >>>> > but I don't see why there needs to be a limit besides
>> >>>>the limits
>> >>>> the
>> >>>> > i915 API already has on the number of engines.
>> Vulkan
>> could
>> >>>> expose
>> >>>> > multiple sparse binding queues to the client if
>> it's not
>> >>>> arbitrarily
>> >>>> > limited.
>> >>>>
>> >>>> Thanks Jason, Lionel.
>> >>>>
>> >>>> Jason, what are you referring to when you say "limits
>> the i915
>> API
>> >>>> already
>> >>>> has on the number of engines"? I am not sure if there is
>> such
>> an uapi
>> >>>> today.
>> >>>>
>> >>>> There's a limit of something like 64 total engines today
>> based on
>> the
>> >>>> number of bits we can cram into the exec flags in
>> execbuffer2. I
>> think
>> >>>> someone had an extended version that allowed more but I
>> ripped it
>> out
>> >>>> because no one was using it. Of course, execbuffer3 might
>> not
>> >>>>have that
>> >>>> problem at all.
>> >>>>
>> >>>
>> >>>Thanks Jason.
>> >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
>> probably
>> >>>will not have this limiation. So, we need to define a
>> VM_BIND_MAX_QUEUE
>> >>>and somehow export it to user (I am thinking of embedding it in
>> >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
>> meaning 2^n
>> >>>queues.
>> >>
>> >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f)
>> which
>> execbuf3
>>
>> Yup! That's exactly the limit I was talking about.
>>
>> >>will also have. So, we can simply define in vm_bind/unbind
>> structures,
>> >>
>> >>#define I915_VM_BIND_MAX_QUEUE 64
>> >> __u32 queue;
>> >>
>> >>I think that will keep things simple.
>> >
>> >Hmmm? What does execbuf2 limit has to do with how many engines
>> >hardware can have? I suggest not to do that.
>> >
>> >Change with added this:
>> >
>> > if (set.num_engines > I915_EXEC_RING_MASK + 1)
>> > return -EINVAL;
>> >
>> >To context creation needs to be undone and so let users create
>> engine
>> >maps with all hardware engines, and let execbuf3 access them all.
>> >
>>
>> Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
>> execbuff3 also.
>> Hence, I was using the same limit for VM_BIND queues (64, or 65
>> if we
>> make it N+1).
>> But, as discussed in other thread of this RFC series, we are
>> planning
>> to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
>> any uapi that limits the number of engines (and hence the vm_bind
>> queues
>> need to be supported).
>>
>> If we leave the number of vm_bind queues to be arbitrarily large
>> (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
>> work_item and a linked list) lookup from the user specified queue
>> index.
>> Other option is to just put some hard limit (say 64 or 65) and use
>> an array of queues in VM (each created upon first use). I
>> prefer this.
>>
>> I don't get why a VM_BIND queue is any different from any other
>> queue or
>> userspace-visible kernel object. But I'll leave those details up to
>> danvet or whoever else might be reviewing the implementation.
>> --Jason
>>
>> I kind of agree here. Wouldn't be simpler to have the bind queue
>> created
>> like the others when we build the engine map?
>>
>> For userspace it's then just matter of selecting the right queue ID
>> when
>> submitting.
>>
>> If there is ever a possibility to have this work on the GPU, it
>> would be
>> all ready.
>>
>
> I did sync offline with Matt Brost on this.
> We can add a VM_BIND engine class and let user create VM_BIND engines
> (queues).
> The problem is, in i915 engine creating interface is bound to
> gem_context.
> So, in vm_bind ioctl, we would need both context_id and queue_idx for
> proper
> lookup of the user created engine. This is bit ackward as vm_bind is an
> interface to VM (address space) and has nothing to do with gem_context.
A gem_context has a single vm object right?
Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.
So it's just like picking up the vm like it's done at execbuffer time
right now : eb->context->vm
> Another problem is, if two VMs are binding with the same defined engine,
> binding on VM1 can get unnecessary blocked by binding on VM2 (which
> may be
> waiting on its in_fence).
Maybe I'm missing something, but how can you have 2 vm objects with a
single gem_context right now?
>
> So, my preference here is to just add a 'u32 queue' index in
> vm_bind/unbind
> ioctl, and the queues are per VM.
>
> Niranjana
>
>> Thanks,
>>
>> -Lionel
>>
>>
>> Niranjana
>>
>> >Regards,
>> >
>> >Tvrtko
>> >
>> >>
>> >>Niranjana
>> >>
>> >>>
>> >>>> I am trying to see how many queues we need and don't
>> want it to
>> be
>> >>>> arbitrarily
>> >>>> large and unduely blow up memory usage and complexity in
>> i915
>> driver.
>> >>>>
>> >>>> I expect a Vulkan driver to use at most 2 in the vast
>> majority
>> >>>>of cases. I
>> >>>> could imagine a client wanting to create more than 1 sparse
>> >>>>queue in which
>> >>>> case, it'll be N+1 but that's unlikely. As far as complexity
>> >>>>goes, once
>> >>>> you allow two, I don't think the complexity is going up by
>> >>>>allowing N. As
>> >>>> for memory usage, creating more queues means more memory.
>> That's
>> a
>> >>>> trade-off that userspace can make. Again, the expected number
>> >>>>here is 1
>> >>>> or 2 in the vast majority of cases so I don't think you
>> need to
>> worry.
>> >>>
>> >>>Ok, will start with n=3 meaning 8 queues.
>> >>>That would require us create 8 workqueues.
>> >>>We can change 'n' later if required.
>> >>>
>> >>>Niranjana
>> >>>
>> >>>>
>> >>>> > Why? Because Vulkan has two basic kind of bind
>> >>>>operations and we
>> >>>> don't
>> >>>> > want any dependencies between them:
>> >>>> > 1. Immediate. These happen right after BO
>> creation or
>> >>>>maybe as
>> >>>> part of
>> >>>> > vkBindImageMemory() or VkBindBufferMemory(). These
>> >>>>don't happen
>> >>>> on a
>> >>>> > queue and we don't want them serialized with
>> anything. To
>> >>>> synchronize
>> >>>> > with submit, we'll have a syncobj in the VkDevice
>> which
>> is
>> >>>> signaled by
>> >>>> > all immediate bind operations and make submits
>> wait on
>> it.
>> >>>> > 2. Queued (sparse): These happen on a VkQueue
>> which may
>> be the
>> >>>> same as
>> >>>> > a render/compute queue or may be its own queue.
>> It's up
>> to us
>> >>>> what we
>> >>>> > want to advertise. From the Vulkan API PoV, this
>> is like
>> any
>> >>>> other
>> >>>> > queue. Operations on it wait on and signal
>> semaphores. If we
>> >>>> have a
>> >>>> > VM_BIND engine, we'd provide syncobjs to wait and
>> >>>>signal just like
>> >>>> we do
>> >>>> > in execbuf().
>> >>>> > The important thing is that we don't want one type of
>> >>>>operation to
>> >>>> block
>> >>>> > on the other. If immediate binds are blocking on
>> sparse
>> binds,
>> >>>> it's
>> >>>> > going to cause over-synchronization issues.
>> >>>> > In terms of the internal implementation, I know that
>> >>>>there's going
>> >>>> to be
>> >>>> > a lock on the VM and that we can't actually do these
>> things in
>> >>>> > parallel. That's fine. Once the dma_fences have
>> signaled and
>> >>>> we're
>> >>>>
>> >>>> Thats correct. It is like a single VM_BIND engine with
>> >>>>multiple queues
>> >>>> feeding to it.
>> >>>>
>> >>>> Right. As long as the queues themselves are independent and
>> >>>>can block on
>> >>>> dma_fences without holding up other queues, I think we're
>> fine.
>> >>>>
>> >>>> > unblocked to do the bind operation, I don't care if
>> >>>>there's a bit
>> >>>> of
>> >>>> > synchronization due to locking. That's expected.
>> What
>> >>>>we can't
>> >>>> afford
>> >>>> > to have is an immediate bind operation suddenly
>> blocking
>> on a
>> >>>> sparse
>> >>>> > operation which is blocked on a compute job that's
>> going
>> to run
>> >>>> for
>> >>>> > another 5ms.
>> >>>>
>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM
>> doesn't block
>> the
>> >>>> VM_BIND
>> >>>> on other VMs. I am not sure about usecases here, but just
>> wanted to
>> >>>> clarify.
>> >>>>
>> >>>> Yes, that's what I would expect.
>> >>>> --Jason
>> >>>>
>> >>>> Niranjana
>> >>>>
>> >>>> > For reference, Windows solves this by allowing
>> arbitrarily many
>> >>>> paging
>> >>>> > queues (what they call a VM_BIND engine/queue). That
>> >>>>design works
>> >>>> > pretty well and solves the problems in question.
>> >>>>Again, we could
>> >>>> just
>> >>>> > make everything out-of-order and require using
>> syncobjs
>> >>>>to order
>> >>>> things
>> >>>> > as userspace wants. That'd be fine too.
>> >>>> > One more note while I'm here: danvet said
>> something on
>> >>>>IRC about
>> >>>> VM_BIND
>> >>>> > queues waiting for syncobjs to materialize. We don't
>> really
>> >>>> want/need
>> >>>> > this. We already have all the machinery in
>> userspace to
>> handle
>> >>>> > wait-before-signal and waiting for syncobj fences to
>> >>>>materialize
>> >>>> and
>> >>>> > that machinery is on by default. It would actually
>> >>>>take MORE work
>> >>>> in
>> >>>> > Mesa to turn it off and take advantage of the kernel
>> >>>>being able to
>> >>>> wait
>> >>>> > for syncobjs to materialize. Also, getting that
>> right is
>> >>>> ridiculously
>> >>>> > hard and I really don't want to get it wrong in
>> kernel
>> >>>>space. When we
>> >>>> > do memory fences, wait-before-signal will be a
>> thing. We
>> don't
>> >>>> need to
>> >>>> > try and make it a thing for syncobj.
>> >>>> > --Jason
>> >>>> >
>> >>>> > Thanks Jason,
>> >>>> >
>> >>>> > I missed the bit in the Vulkan spec that we're
>> allowed to
>> have a
>> >>>> sparse
>> >>>> > queue that does not implement either graphics or
>> compute
>> >>>>operations
>> >>>> :
>> >>>> >
>> >>>> > "While some implementations may include
>> >>>> VK_QUEUE_SPARSE_BINDING_BIT
>> >>>> > support in queue families that also include
>> >>>> >
>> >>>> > graphics and compute support, other
>> implementations may
>> only
>> >>>> expose a
>> >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue
>> >>>> >
>> >>>> > family."
>> >>>> >
>> >>>> > So it can all be all a vm_bind engine that just does
>> bind/unbind
>> >>>> > operations.
>> >>>> >
>> >>>> > But yes we need another engine for the
>> immediate/non-sparse
>> >>>> operations.
>> >>>> >
>> >>>> > -Lionel
>> >>>> >
>> >>>> > >
>> >>>> > Daniel, any thoughts?
>> >>>> >
>> >>>> > Niranjana
>> >>>> >
>> >>>> > >Matt
>> >>>> > >
>> >>>> > >>
>> >>>> > >> Sorry I noticed this late.
>> >>>> > >>
>> >>>> > >>
>> >>>> > >> -Lionel
>> >>>> > >>
>> >>>> > >>
More information about the Intel-gfx
mailing list