<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">On 09/06/2022 00:55, Jason Ekstrand
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAOFGe97UDd2S+LdKeOWubFvc4cNy6KbRTtCPKUbwd8PnZPuvMQ@mail.gmail.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Jun 8, 2022 at 4:44
PM Niranjana Vishwanathapura <<a
href="mailto:niranjana.vishwanathapura@intel.com"
moz-do-not-send="true" class="moz-txt-link-freetext">niranjana.vishwanathapura@intel.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">On Wed, Jun 08, 2022 at
08:33:25AM +0100, Tvrtko Ursulin wrote:<br>
><br>
><br>
>On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:<br>
>>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
Vishwanathapura wrote:<br>
>>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason
Ekstrand wrote:<br>
>>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana
Vishwanathapura<br>
>>>> <<a
href="mailto:niranjana.vishwanathapura@intel.com"
target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">niranjana.vishwanathapura@intel.com</a>>
wrote:<br>
>>>><br>
>>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300,
Lionel Landwerlin wrote:<br>
>>>> > On 02/06/2022 23:35, Jason
Ekstrand wrote:<br>
>>>> ><br>
>>>> > On Thu, Jun 2, 2022 at 3:11 PM
Niranjana Vishwanathapura<br>
>>>> > <<a
href="mailto:niranjana.vishwanathapura@intel.com"
target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">niranjana.vishwanathapura@intel.com</a>>
wrote:<br>
>>>> ><br>
>>>> > On Wed, Jun 01, 2022 at
01:28:36PM -0700, Matthew <br>
>>>>Brost wrote:<br>
>>>> > >On Wed, Jun 01, 2022 at
05:25:49PM +0300, Lionel Landwerlin<br>
>>>> wrote:<br>
>>>> > >> On 17/05/2022 21:32,
Niranjana Vishwanathapura wrote:<br>
>>>> > >> > +VM_BIND/UNBIND
ioctl will immediately start<br>
>>>> binding/unbinding<br>
>>>> > the mapping in an<br>
>>>> > >> > +async worker.
The binding and unbinding will <br>
>>>>work like a<br>
>>>> special<br>
>>>> > GPU engine.<br>
>>>> > >> > +The binding and
unbinding operations are serialized and<br>
>>>> will<br>
>>>> > wait on specified<br>
>>>> > >> > +input fences
before the operation and will signal the<br>
>>>> output<br>
>>>> > fences upon the<br>
>>>> > >> > +completion of
the operation. Due to serialization,<br>
>>>> completion of<br>
>>>> > an operation<br>
>>>> > >> > +will also
indicate that all previous operations <br>
>>>>are also<br>
>>>> > complete.<br>
>>>> > >><br>
>>>> > >> I guess we should
avoid saying "will immediately start<br>
>>>> > binding/unbinding" if<br>
>>>> > >> there are fences
involved.<br>
>>>> > >><br>
>>>> > >> And the fact that
it's happening in an async <br>
>>>>worker seem to<br>
>>>> imply<br>
>>>> > it's not<br>
>>>> > >> immediate.<br>
>>>> > >><br>
>>>> ><br>
>>>> > Ok, will fix.<br>
>>>> > This was added because in
earlier design binding was deferred<br>
>>>> until<br>
>>>> > next execbuff.<br>
>>>> > But now it is non-deferred
(immediate in that sense). <br>
>>>>But yah,<br>
>>>> this is<br>
>>>> > confusing<br>
>>>> > and will fix it.<br>
>>>> ><br>
>>>> > >><br>
>>>> > >> I have a question on
the behavior of the bind <br>
>>>>operation when<br>
>>>> no<br>
>>>> > input fence<br>
>>>> > >> is provided. Let say
I do :<br>
>>>> > >><br>
>>>> > >> VM_BIND
(out_fence=fence1)<br>
>>>> > >><br>
>>>> > >> VM_BIND
(out_fence=fence2)<br>
>>>> > >><br>
>>>> > >> VM_BIND
(out_fence=fence3)<br>
>>>> > >><br>
>>>> > >><br>
>>>> > >> In what order are the
fences going to be signaled?<br>
>>>> > >><br>
>>>> > >> In the order of
VM_BIND ioctls? Or out of order?<br>
>>>> > >><br>
>>>> > >> Because you wrote
"serialized I assume it's : in order<br>
>>>> > >><br>
>>>> ><br>
>>>> > Yes, in the order of
VM_BIND/UNBIND ioctls. Note that <br>
>>>>bind and<br>
>>>> unbind<br>
>>>> > will use<br>
>>>> > the same queue and hence are
ordered.<br>
>>>> ><br>
>>>> > >><br>
>>>> > >> One thing I didn't
realize is that because we only get one<br>
>>>> > "VM_BIND" engine,<br>
>>>> > >> there is a disconnect
from the Vulkan specification.<br>
>>>> > >><br>
>>>> > >> In Vulkan VM_BIND
operations are serialized but <br>
>>>>per engine.<br>
>>>> > >><br>
>>>> > >> So you could have
something like this :<br>
>>>> > >><br>
>>>> > >> VM_BIND (engine=rcs0,
in_fence=fence1, out_fence=fence2)<br>
>>>> > >><br>
>>>> > >> VM_BIND (engine=ccs0,
in_fence=fence3, out_fence=fence4)<br>
>>>> > >><br>
>>>> > >><br>
>>>> > >> fence1 is not
signaled<br>
>>>> > >><br>
>>>> > >> fence3 is signaled<br>
>>>> > >><br>
>>>> > >> So the second VM_BIND
will proceed before the <br>
>>>>first VM_BIND.<br>
>>>> > >><br>
>>>> > >><br>
>>>> > >> I guess we can deal
with that scenario in <br>
>>>>userspace by doing<br>
>>>> the<br>
>>>> > wait<br>
>>>> > >> ourselves in one
thread per engines.<br>
>>>> > >><br>
>>>> > >> But then it makes the
VM_BIND input fences useless.<br>
>>>> > >><br>
>>>> > >><br>
>>>> > >> Daniel : what do you
think? Should be rework this or just<br>
>>>> deal with<br>
>>>> > wait<br>
>>>> > >> fences in userspace?<br>
>>>> > >><br>
>>>> > ><br>
>>>> > >My opinion is rework this
but make the ordering via <br>
>>>>an engine<br>
>>>> param<br>
>>>> > optional.<br>
>>>> > ><br>
>>>> > >e.g. A VM can be
configured so all binds are ordered <br>
>>>>within the<br>
>>>> VM<br>
>>>> > ><br>
>>>> > >e.g. A VM can be
configured so all binds accept an engine<br>
>>>> argument<br>
>>>> > (in<br>
>>>> > >the case of the i915
likely this is a gem context <br>
>>>>handle) and<br>
>>>> binds<br>
>>>> > >ordered with respect to
that engine.<br>
>>>> > ><br>
>>>> > >This gives UMDs options as
the later likely consumes <br>
>>>>more KMD<br>
>>>> > resources<br>
>>>> > >so if a different UMD can
live with binds being <br>
>>>>ordered within<br>
>>>> the VM<br>
>>>> > >they can use a mode
consuming less resources.<br>
>>>> > ><br>
>>>> ><br>
>>>> > I think we need to be careful
here if we are looking for some<br>
>>>> out of<br>
>>>> > (submission) order completion
of vm_bind/unbind.<br>
>>>> > In-order completion means, in
a batch of binds and <br>
>>>>unbinds to be<br>
>>>> > completed in-order, user only
needs to specify <br>
>>>>in-fence for the<br>
>>>> > first bind/unbind call and the
our-fence for the last<br>
>>>> bind/unbind<br>
>>>> > call. Also, the VA released by
an unbind call can be <br>
>>>>re-used by<br>
>>>> > any subsequent bind call in
that in-order batch.<br>
>>>> ><br>
>>>> > These things will break if
binding/unbinding were to <br>
>>>>be allowed<br>
>>>> to<br>
>>>> > go out of order (of
submission) and user need to be extra<br>
>>>> careful<br>
>>>> > not to run into pre-mature
triggereing of out-fence and bind<br>
>>>> failing<br>
>>>> > as VA is still in use etc.<br>
>>>> ><br>
>>>> > Also, VM_BIND binds the
provided mapping on the specified<br>
>>>> address<br>
>>>> > space<br>
>>>> > (VM). So, the uapi is not
engine/context specific.<br>
>>>> ><br>
>>>> > We can however add a 'queue'
to the uapi which can be <br>
>>>>one from<br>
>>>> the<br>
>>>> > pre-defined queues,<br>
>>>> > I915_VM_BIND_QUEUE_0<br>
>>>> > I915_VM_BIND_QUEUE_1<br>
>>>> > ...<br>
>>>> > I915_VM_BIND_QUEUE_(N-1)<br>
>>>> ><br>
>>>> > KMD will spawn an async work
queue for each queue which will<br>
>>>> only<br>
>>>> > bind the mappings on that
queue in the order of submission.<br>
>>>> > User can assign the queue to
per engine or anything <br>
>>>>like that.<br>
>>>> ><br>
>>>> > But again here, user need to
be careful and not <br>
>>>>deadlock these<br>
>>>> > queues with circular
dependency of fences.<br>
>>>> ><br>
>>>> > I prefer adding this later an
as extension based on <br>
>>>>whether it<br>
>>>> > is really helping with the
implementation.<br>
>>>> ><br>
>>>> > I can tell you right now that
having everything on a single<br>
>>>> in-order<br>
>>>> > queue will not get us the perf
we want. What vulkan <br>
>>>>really wants<br>
>>>> is one<br>
>>>> > of two things:<br>
>>>> > 1. No implicit ordering of
VM_BIND ops. They just happen in<br>
>>>> whatever<br>
>>>> > their dependencies are resolved
and we ensure ordering <br>
>>>>ourselves<br>
>>>> by<br>
>>>> > having a syncobj in the VkQueue.<br>
>>>> > 2. The ability to create
multiple VM_BIND queues. We need at<br>
>>>> least 2<br>
>>>> > but I don't see why there needs
to be a limit besides <br>
>>>>the limits<br>
>>>> the<br>
>>>> > i915 API already has on the
number of engines. Vulkan could<br>
>>>> expose<br>
>>>> > multiple sparse binding queues
to the client if it's not<br>
>>>> arbitrarily<br>
>>>> > limited.<br>
>>>><br>
>>>> Thanks Jason, Lionel.<br>
>>>><br>
>>>> Jason, what are you referring to when you
say "limits the i915 API<br>
>>>> already<br>
>>>> has on the number of engines"? I am not
sure if there is such an uapi<br>
>>>> today.<br>
>>>><br>
>>>> There's a limit of something like 64 total
engines today based on the<br>
>>>> number of bits we can cram into the exec
flags in execbuffer2. I think<br>
>>>> someone had an extended version that
allowed more but I ripped it out<br>
>>>> because no one was using it. Of course,
execbuffer3 might not <br>
>>>>have that<br>
>>>> problem at all.<br>
>>>><br>
>>><br>
>>>Thanks Jason.<br>
>>>Ok, I am not sure which exec flag is that, but
yah, execbuffer3 probably<br>
>>>will not have this limiation. So, we need to
define a VM_BIND_MAX_QUEUE<br>
>>>and somehow export it to user (I am thinking of
embedding it in<br>
>>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND,
bits[1-3]->'n' meaning 2^n<br>
>>>queues.<br>
>><br>
>>Ah, I think you are waking about I915_EXEC_RING_MASK
(0x3f) which execbuf3<br>
</blockquote>
<div><br>
</div>
<div>Yup! That's exactly the limit I was talking about.<br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
>>will also have. So, we can simply define in
vm_bind/unbind structures,<br>
>><br>
>>#define I915_VM_BIND_MAX_QUEUE 64<br>
>> __u32 queue;<br>
>><br>
>>I think that will keep things simple.<br>
><br>
>Hmmm? What does execbuf2 limit has to do with how many
engines <br>
>hardware can have? I suggest not to do that.<br>
><br>
>Change with added this:<br>
><br>
> if (set.num_engines > I915_EXEC_RING_MASK + 1)<br>
> return -EINVAL;<br>
><br>
>To context creation needs to be undone and so let users
create engine <br>
>maps with all hardware engines, and let execbuf3 access
them all.<br>
><br>
<br>
Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to
execbuff3 also.<br>
Hence, I was using the same limit for VM_BIND queues (64, or
65 if we<br>
make it N+1).<br>
But, as discussed in other thread of this RFC series, we are
planning<br>
to drop this I915_EXEC_RING_MAP in execbuff3. So, there
won't be<br>
any uapi that limits the number of engines (and hence the
vm_bind queues<br>
need to be supported).<br>
<br>
If we leave the number of vm_bind queues to be arbitrarily
large<br>
(__u32 queue_idx) then, we need to have a hashmap for queue
(a wq,<br>
work_item and a linked list) lookup from the user specified
queue index.<br>
Other option is to just put some hard limit (say 64 or 65)
and use<br>
an array of queues in VM (each created upon first use). I
prefer this.<br>
</blockquote>
<div><br>
</div>
<div>I don't get why a VM_BIND queue is any different from any
other queue or userspace-visible kernel object. But I'll
leave those details up to danvet or whoever else might be
reviewing the implementation.</div>
<div><br>
</div>
<div>--Jason</div>
</div>
</div>
</blockquote>
<p><br>
</p>
<p>I kind of agree here. Wouldn't be simpler to have the bind queue
created like the others when we build the engine map?</p>
<p>For userspace it's then just matter of selecting the right queue
ID when submitting.</p>
<p>If there is ever a possibility to have this work on the GPU, it
would be all ready.</p>
<p><br>
</p>
<p>Thanks,<br>
</p>
<p><br>
</p>
<p>-Lionel<br>
</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:CAOFGe97UDd2S+LdKeOWubFvc4cNy6KbRTtCPKUbwd8PnZPuvMQ@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<br>
Niranjana<br>
<br>
>Regards,<br>
><br>
>Tvrtko<br>
><br>
>><br>
>>Niranjana<br>
>><br>
>>><br>
>>>> I am trying to see how many queues we
need and don't want it to be<br>
>>>> arbitrarily<br>
>>>> large and unduely blow up memory usage
and complexity in i915 driver.<br>
>>>><br>
>>>> I expect a Vulkan driver to use at most 2
in the vast majority <br>
>>>>of cases. I<br>
>>>> could imagine a client wanting to create
more than 1 sparse <br>
>>>>queue in which<br>
>>>> case, it'll be N+1 but that's unlikely. As
far as complexity <br>
>>>>goes, once<br>
>>>> you allow two, I don't think the complexity
is going up by <br>
>>>>allowing N. As<br>
>>>> for memory usage, creating more queues
means more memory. That's a<br>
>>>> trade-off that userspace can make. Again,
the expected number <br>
>>>>here is 1<br>
>>>> or 2 in the vast majority of cases so I
don't think you need to worry.<br>
>>><br>
>>>Ok, will start with n=3 meaning 8 queues.<br>
>>>That would require us create 8 workqueues.<br>
>>>We can change 'n' later if required.<br>
>>><br>
>>>Niranjana<br>
>>><br>
>>>><br>
>>>> > Why? Because Vulkan has two
basic kind of bind <br>
>>>>operations and we<br>
>>>> don't<br>
>>>> > want any dependencies between
them:<br>
>>>> > 1. Immediate. These happen
right after BO creation or <br>
>>>>maybe as<br>
>>>> part of<br>
>>>> > vkBindImageMemory() or
VkBindBufferMemory(). These <br>
>>>>don't happen<br>
>>>> on a<br>
>>>> > queue and we don't want them
serialized with anything. To<br>
>>>> synchronize<br>
>>>> > with submit, we'll have a
syncobj in the VkDevice which is<br>
>>>> signaled by<br>
>>>> > all immediate bind operations
and make submits wait on it.<br>
>>>> > 2. Queued (sparse): These
happen on a VkQueue which may be the<br>
>>>> same as<br>
>>>> > a render/compute queue or may be
its own queue. It's up to us<br>
>>>> what we<br>
>>>> > want to advertise. From the
Vulkan API PoV, this is like any<br>
>>>> other<br>
>>>> > queue. Operations on it wait on
and signal semaphores. If we<br>
>>>> have a<br>
>>>> > VM_BIND engine, we'd provide
syncobjs to wait and <br>
>>>>signal just like<br>
>>>> we do<br>
>>>> > in execbuf().<br>
>>>> > The important thing is that we
don't want one type of <br>
>>>>operation to<br>
>>>> block<br>
>>>> > on the other. If immediate
binds are blocking on sparse binds,<br>
>>>> it's<br>
>>>> > going to cause
over-synchronization issues.<br>
>>>> > In terms of the internal
implementation, I know that <br>
>>>>there's going<br>
>>>> to be<br>
>>>> > a lock on the VM and that we
can't actually do these things in<br>
>>>> > parallel. That's fine. Once
the dma_fences have signaled and<br>
>>>> we're<br>
>>>><br>
>>>> Thats correct. It is like a single
VM_BIND engine with <br>
>>>>multiple queues<br>
>>>> feeding to it.<br>
>>>><br>
>>>> Right. As long as the queues themselves
are independent and <br>
>>>>can block on<br>
>>>> dma_fences without holding up other queues,
I think we're fine.<br>
>>>><br>
>>>> > unblocked to do the bind
operation, I don't care if <br>
>>>>there's a bit<br>
>>>> of<br>
>>>> > synchronization due to locking.
That's expected. What <br>
>>>>we can't<br>
>>>> afford<br>
>>>> > to have is an immediate bind
operation suddenly blocking on a<br>
>>>> sparse<br>
>>>> > operation which is blocked on a
compute job that's going to run<br>
>>>> for<br>
>>>> > another 5ms.<br>
>>>><br>
>>>> As the VM_BIND queue is per VM, VM_BIND
on one VM doesn't block the<br>
>>>> VM_BIND<br>
>>>> on other VMs. I am not sure about
usecases here, but just wanted to<br>
>>>> clarify.<br>
>>>><br>
>>>> Yes, that's what I would expect.<br>
>>>> --Jason<br>
>>>><br>
>>>> Niranjana<br>
>>>><br>
>>>> > For reference, Windows solves
this by allowing arbitrarily many<br>
>>>> paging<br>
>>>> > queues (what they call a VM_BIND
engine/queue). That <br>
>>>>design works<br>
>>>> > pretty well and solves the
problems in question. <br>
>>>>Again, we could<br>
>>>> just<br>
>>>> > make everything out-of-order and
require using syncobjs <br>
>>>>to order<br>
>>>> things<br>
>>>> > as userspace wants. That'd be
fine too.<br>
>>>> > One more note while I'm here:
danvet said something on <br>
>>>>IRC about<br>
>>>> VM_BIND<br>
>>>> > queues waiting for syncobjs to
materialize. We don't really<br>
>>>> want/need<br>
>>>> > this. We already have all the
machinery in userspace to handle<br>
>>>> > wait-before-signal and waiting
for syncobj fences to <br>
>>>>materialize<br>
>>>> and<br>
>>>> > that machinery is on by
default. It would actually <br>
>>>>take MORE work<br>
>>>> in<br>
>>>> > Mesa to turn it off and take
advantage of the kernel <br>
>>>>being able to<br>
>>>> wait<br>
>>>> > for syncobjs to materialize.
Also, getting that right is<br>
>>>> ridiculously<br>
>>>> > hard and I really don't want to
get it wrong in kernel <br>
>>>>space. When we<br>
>>>> > do memory fences,
wait-before-signal will be a thing. We don't<br>
>>>> need to<br>
>>>> > try and make it a thing for
syncobj.<br>
>>>> > --Jason<br>
>>>> ><br>
>>>> > Thanks Jason,<br>
>>>> ><br>
>>>> > I missed the bit in the Vulkan
spec that we're allowed to have a<br>
>>>> sparse<br>
>>>> > queue that does not implement
either graphics or compute <br>
>>>>operations<br>
>>>> :<br>
>>>> ><br>
>>>> > "While some implementations may
include<br>
>>>> VK_QUEUE_SPARSE_BINDING_BIT<br>
>>>> > support in queue families that
also include<br>
>>>> ><br>
>>>> > graphics and compute support,
other implementations may only<br>
>>>> expose a<br>
>>>> > VK_QUEUE_SPARSE_BINDING_BIT-only
queue<br>
>>>> ><br>
>>>> > family."<br>
>>>> ><br>
>>>> > So it can all be all a vm_bind
engine that just does bind/unbind<br>
>>>> > operations.<br>
>>>> ><br>
>>>> > But yes we need another engine for
the immediate/non-sparse<br>
>>>> operations.<br>
>>>> ><br>
>>>> > -Lionel<br>
>>>> ><br>
>>>> > ><br>
>>>> > Daniel, any thoughts?<br>
>>>> ><br>
>>>> > Niranjana<br>
>>>> ><br>
>>>> > >Matt<br>
>>>> > ><br>
>>>> > >><br>
>>>> > >> Sorry I noticed this
late.<br>
>>>> > >><br>
>>>> > >><br>
>>>> > >> -Lionel<br>
>>>> > >><br>
>>>> > >><br>
</blockquote>
</div>
</div>
</blockquote>
<p><br>
</p>
</body>
</html>