This is the i915 driver VM_BIND feature design RFC patch series along with the required uapi definition and description of intended use cases.
v2: Updated design and uapi, more documentation. v3: Add more documentation and proper kernel-doc formatting with cross references (including missing i915_drm uapi kernel-docs which are required) as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Niranjana Vishwanathapura (3): drm/doc/rfc: VM_BIND feature design document drm/i915: Update i915 uapi documentation drm/doc/rfc: VM_BIND uapi definition
Documentation/driver-api/dma-buf.rst | 2 + Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++ Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + include/uapi/drm/i915_drm.h | 153 +++++++--- 5 files changed, 825 insertions(+), 37 deletions(-) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
VM_BIND design document with description of intended use cases.
v2: Add more documentation and format as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com --- Documentation/driver-api/dma-buf.rst | 2 + Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 3 files changed, 310 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 36a76cbe9095..64cb924ec5bb 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File .. kernel-doc:: include/linux/sync_file.h :internal:
+.. _indefinite_dma_fences: + Indefinite DMA Fences ~~~~~~~~~~~~~~~~~~~~~
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..f1be560d313c --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,304 @@ +========================================== +I915 VM_BIND feature design and use cases +========================================== + +VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a +specified address space (VM). These mappings (also referred to as persistent +mappings) will be persistent across multiple GPU submissions (execbuff calls) +issued by the UMD, without user having to provide a list of all required +mappings during each submission (as required by older execbuff mode). + +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace +to specify how the binding/unbinding should sync with other operations +like the GPU job submission. These fences will be timeline 'drm_syncobj's +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences). +For Compute contexts, they will be user/memory fences (See struct +drm_i915_vm_bind_ext_user_fence). + +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. + +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete. + +VM_BIND features include: + +* Multiple Virtual Address (VA) mappings can map to the same physical pages + of an object (aliasing). +* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some + use cases will be helpful. +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this). + +Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff mode of +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence, +no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases: + +"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/) + +This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses). + +If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later. + +In VM_BIND mode, VA allocation is completely managed by the user instead of +the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_). + +So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up +by clearly separating out the functionalities where the VM_BIND mode differs +from older method and they should be moved to separate files. + +VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM. + +VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs. + +VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode, the +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page faults manages +residency of backing storage using dma_fence. The VM_BIND mode with page faults +and the system allocator support do not use any dma_fence at all. + +VM_BIND locking order is as below. + +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in + vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the + mapping. + + In future, when GPU page faults are supported, we can potentially use a + rwsem instead, so that multiple page fault handlers can take the read side + lock to lookup the mapping and hence can run in parallel. + The older execbuff mode of binding do not need this lock. + +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to + be held while binding/unbinding a vma in the async worker and while updating + dma-resv fence list of an object. Note that private BOs of a VM will all + share a dma-resv object. + + The future system allocator support will use the HMM prescribed locking + instead. + +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of + invalidated vmas (due to eviction and userptr invalidation) etc. + +When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we need to +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for +system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm lists, +so won't ever need lock-C. + +VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid +performance degradation. We will also need support for bulk LRU movement of +VM_BIND objects to avoid additional latencies in execbuff path. + +The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call with +that VM). So, bulk LRU movement of page table pages is also needed. + +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture. + +VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency +setting (either through explicit or implicit mechanism). + +When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff submission. + +Also, in VM_BIND mode, use dma-resv apis for determining object activeness +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for igfx. + +Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated). + +Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful. + + +VM_BIND Compute support +======================== + +User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl. + +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it. + +User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation. + +When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds. + +This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/ + +Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during +context creation. The dma-fence based user interfaces like gem_wait ioctl and +execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode. + +Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption. + +As this support for context suspension using a preempt fence and the resume work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute mode +contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment. + +Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the directly +submitted jobs. + +Other VM_BIND use cases +======================== + +Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debugged) and attached +to GPU via vm_bind interface. + +GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of +binding will require using dma-fence to ensure residency, the GPU page faults +mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries. + +Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support. + +Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO. + +Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled. + + +Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +use cases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into. + +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature + do not use it and complexity it brings in is probably more than the + performance advantage we get in legacy execbuff case. +- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using + it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma + is active or not. + + +VM_BIND UAPI +============= + +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst + +.. toctree:: + + i915_vm_bind.rst
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
VM_BIND design document with description of intended use cases.
v2: Add more documentation and format as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..f1be560d313c --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,304 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================
+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a +specified address space (VM). These mappings (also referred to as persistent +mappings) will be persistent across multiple GPU submissions (execbuff calls) +issued by the UMD, without user having to provide a list of all required +mappings during each submission (as required by older execbuff mode).
+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace +to specify how the binding/unbinding should sync with other operations +like the GPU job submission. These fences will be timeline 'drm_syncobj's +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences). +For Compute contexts, they will be user/memory fences (See struct +drm_i915_vm_bind_ext_user_fence).
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
+VM_BIND features include:
+* Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
- use cases will be helpful.
+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this).
+Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff mode of +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence, +no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases:
+"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/)
I would really like to have more details here. The link provided points to new ioctls and we're not very familiar with those yet, so I think you should really clarify the interaction between the new additions here. Having some sample code would be really nice too.
For Mesa at least (and I believe for the other drivers too) we always have a few exported buffers in every execbuf call, and we rely on the implicit synchronization provided by execbuf to make sure everything works. The execbuf ioctl also has some code to flush caches during implicit synchronization AFAIR, so I would guess we rely on it too and whatever else the Kernel does. Is that covered by the new ioctls?
In addition, as far as I remember, one of the big improvements of vm_bind was that it would help reduce ioctl latency and cpu overhead. But if making execbuf faster comes at the cost of requiring additional ioctls calls for implicit synchronization, which is required on ever execbuf call, then I wonder if we'll even get any faster at all. Comparing old execbuf vs plain new execbuf without the new required ioctls won't make sense.
But maybe I'm wrong and we won't need to call these new ioctls around every single execbuf ioctl we submit? Again, more clarification and some code examples here would be really nice. This is a big change on an important part of the API, we should clarify the new expected usage.
+This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later.
IMHO we really need to sort this and check all the assumptions before we commit to any interface. Again, implicit synchronization is something we rely on during *every* execbuf ioctl for most workloads.
+In VM_BIND mode, VA allocation is completely managed by the user instead of +the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_).
+So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up +by clearly separating out the functionalities where the VM_BIND mode differs +from older method and they should be moved to separate files.
I seem to recall some conversations where we were told a bunch of ioctls would stop working or make no sense to call when using vm_bind. Can we please get a complete list of those? Bonus points if the Kernel starts telling us we just called something that makes no sense.
+VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
I know we already discussed this, but just to document it publicly: the ideal case for user space would be that every BO is created as private but then we'd have an ioctl to convert it to non-private (without the need to have a non-private->private interface).
An explanation on why we can't have an ioctl to mark as exported a buffer that was previously vm_private would be really appreciated.
Thanks, Paulo
+VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode, the +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page faults manages +residency of backing storage using dma_fence. The VM_BIND mode with page faults +and the system allocator support do not use any dma_fence at all.
+VM_BIND locking order is as below.
+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
- vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
- mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple page fault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
- The older execbuff mode of binding do not need this lock.
+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
- be held while binding/unbinding a vma in the async worker and while updating
- dma-resv fence list of an object. Note that private BOs of a VM will all
- share a dma-resv object.
- The future system allocator support will use the HMM prescribed locking
- instead.
+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
- invalidated vmas (due to eviction and userptr invalidation) etc.
+When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we need to +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for +system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm lists, +so won't ever need lock-C.
+VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid +performance degradation. We will also need support for bulk LRU movement of +VM_BIND objects to avoid additional latencies in execbuff path.
+The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call with +that VM). So, bulk LRU movement of page table pages is also needed.
+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture.
+VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency +setting (either through explicit or implicit mechanism).
+When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff submission.
+Also, in VM_BIND mode, use dma-resv apis for determining object activeness +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for igfx.
+Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated).
+Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful.
+VM_BIND Compute support +========================
+User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during +context creation. The dma-fence based user interfaces like gem_wait ioctl and +execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption.
+As this support for context suspension using a preempt fence and the resume work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute mode +contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the directly +submitted jobs.
+Other VM_BIND use cases +========================
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debugged) and attached +to GPU via vm_bind interface.
+GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of +binding will require using dma-fence to ensure residency, the GPU page faults +mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries.
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled.
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +use cases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
- it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
- is active or not.
+VM_BIND UAPI +=============
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
On Thu, May 19, 2022 at 03:52:01PM -0700, Zanoni, Paulo R wrote:
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
VM_BIND design document with description of intended use cases.
v2: Add more documentation and format as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..f1be560d313c --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,304 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================
+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a +specified address space (VM). These mappings (also referred to as persistent +mappings) will be persistent across multiple GPU submissions (execbuff calls) +issued by the UMD, without user having to provide a list of all required +mappings during each submission (as required by older execbuff mode).
+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace +to specify how the binding/unbinding should sync with other operations +like the GPU job submission. These fences will be timeline 'drm_syncobj's +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences). +For Compute contexts, they will be user/memory fences (See struct +drm_i915_vm_bind_ext_user_fence).
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
+VM_BIND features include:
+* Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
- use cases will be helpful.
+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this).
+Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff mode of +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence, +no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases:
+"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/)
I would really like to have more details here. The link provided points to new ioctls and we're not very familiar with those yet, so I think you should really clarify the interaction between the new additions here. Having some sample code would be really nice too.
For Mesa at least (and I believe for the other drivers too) we always have a few exported buffers in every execbuf call, and we rely on the implicit synchronization provided by execbuf to make sure everything works. The execbuf ioctl also has some code to flush caches during implicit synchronization AFAIR, so I would guess we rely on it too and whatever else the Kernel does. Is that covered by the new ioctls?
In addition, as far as I remember, one of the big improvements of vm_bind was that it would help reduce ioctl latency and cpu overhead. But if making execbuf faster comes at the cost of requiring additional ioctls calls for implicit synchronization, which is required on ever execbuf call, then I wonder if we'll even get any faster at all. Comparing old execbuf vs plain new execbuf without the new required ioctls won't make sense.
But maybe I'm wrong and we won't need to call these new ioctls around every single execbuf ioctl we submit? Again, more clarification and some code examples here would be really nice. This is a big change on an important part of the API, we should clarify the new expected usage.
Thanks Paulo for the comments.
In VM_BIND mode, the only reason we would need execlist support in execbuff path is for implicit synchronization. And AFAIK, this work from Jason is expected replace implict synchronization with new ioctls. Hence, VM_BIND mode will not be needing execlist support at all.
Based on comments from Daniel and my offline sync with Jason, this new mechanism from Jason is expected work for vl. For gl, there is a question of whether it will be performant or not. But it is worth trying that first. If it is not performant for gl, then only we can consider adding implicit sync support back for VM_BIND mode.
Daniel, Jason, Ken, any thoughts you can add here?
+This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later.
IMHO we really need to sort this and check all the assumptions before we commit to any interface. Again, implicit synchronization is something we rely on during *every* execbuf ioctl for most workloads.
Daniel's earlier feedback was that it is worth Mesa trying this new mechanism for gl and see it that works. We want to avoid supporting execlist support for implicit sync in vm_bind mode from the beginning if it is going to be deemed not necessary.
+In VM_BIND mode, VA allocation is completely managed by the user instead of +the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_).
+So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up +by clearly separating out the functionalities where the VM_BIND mode differs +from older method and they should be moved to separate files.
I seem to recall some conversations where we were told a bunch of ioctls would stop working or make no sense to call when using vm_bind. Can we please get a complete list of those? Bonus points if the Kernel starts telling us we just called something that makes no sense.
Which ioctls you are talking about here? We do not support GEM_WAIT ioctls, but that is only for compute mode (which is already documented in this patch).
+VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
I know we already discussed this, but just to document it publicly: the ideal case for user space would be that every BO is created as private but then we'd have an ioctl to convert it to non-private (without the need to have a non-private->private interface).
An explanation on why we can't have an ioctl to mark as exported a buffer that was previously vm_private would be really appreciated.
Ok, I can some notes on that. The reason being the fact that this require changing the dma-resv object for gem object, hence the object locking also. This will add complications as we have to sync with any pending operations. It might be easier for UMDs to do it themselves by copying the object contexts to a new object.
Niranjana
Thanks, Paulo
+VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode, the +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page faults manages +residency of backing storage using dma_fence. The VM_BIND mode with page faults +and the system allocator support do not use any dma_fence at all.
+VM_BIND locking order is as below.
+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
- vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
- mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple page fault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
- The older execbuff mode of binding do not need this lock.
+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
- be held while binding/unbinding a vma in the async worker and while updating
- dma-resv fence list of an object. Note that private BOs of a VM will all
- share a dma-resv object.
- The future system allocator support will use the HMM prescribed locking
- instead.
+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
- invalidated vmas (due to eviction and userptr invalidation) etc.
+When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we need to +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for +system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm lists, +so won't ever need lock-C.
+VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid +performance degradation. We will also need support for bulk LRU movement of +VM_BIND objects to avoid additional latencies in execbuff path.
+The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call with +that VM). So, bulk LRU movement of page table pages is also needed.
+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture.
+VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency +setting (either through explicit or implicit mechanism).
+When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff submission.
+Also, in VM_BIND mode, use dma-resv apis for determining object activeness +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for igfx.
+Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated).
+Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful.
+VM_BIND Compute support +========================
+User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during +context creation. The dma-fence based user interfaces like gem_wait ioctl and +execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption.
+As this support for context suspension using a preempt fence and the resume work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute mode +contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the directly +submitted jobs.
+Other VM_BIND use cases +========================
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debugged) and attached +to GPU via vm_bind interface.
+GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of +binding will require using dma-fence to ensure residency, the GPU page faults +mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries.
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled.
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +use cases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
- it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
- is active or not.
+VM_BIND UAPI +=============
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
On Mon, May 23, 2022 at 12:05:05PM -0700, Niranjana Vishwanathapura wrote:
On Thu, May 19, 2022 at 03:52:01PM -0700, Zanoni, Paulo R wrote:
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
VM_BIND design document with description of intended use cases.
v2: Add more documentation and format as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..f1be560d313c --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,304 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================
+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a +specified address space (VM). These mappings (also referred to as persistent +mappings) will be persistent across multiple GPU submissions (execbuff calls) +issued by the UMD, without user having to provide a list of all required +mappings during each submission (as required by older execbuff mode).
+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace +to specify how the binding/unbinding should sync with other operations +like the GPU job submission. These fences will be timeline 'drm_syncobj's +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences). +For Compute contexts, they will be user/memory fences (See struct +drm_i915_vm_bind_ext_user_fence).
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
+VM_BIND features include:
+* Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
- use cases will be helpful.
+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this).
+Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff mode of +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence, +no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases:
+"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/)
I would really like to have more details here. The link provided points to new ioctls and we're not very familiar with those yet, so I think you should really clarify the interaction between the new additions here. Having some sample code would be really nice too.
For Mesa at least (and I believe for the other drivers too) we always have a few exported buffers in every execbuf call, and we rely on the implicit synchronization provided by execbuf to make sure everything works. The execbuf ioctl also has some code to flush caches during implicit synchronization AFAIR, so I would guess we rely on it too and whatever else the Kernel does. Is that covered by the new ioctls?
In addition, as far as I remember, one of the big improvements of vm_bind was that it would help reduce ioctl latency and cpu overhead. But if making execbuf faster comes at the cost of requiring additional ioctls calls for implicit synchronization, which is required on ever execbuf call, then I wonder if we'll even get any faster at all. Comparing old execbuf vs plain new execbuf without the new required ioctls won't make sense.
But maybe I'm wrong and we won't need to call these new ioctls around every single execbuf ioctl we submit? Again, more clarification and some code examples here would be really nice. This is a big change on an important part of the API, we should clarify the new expected usage.
Thanks Paulo for the comments.
In VM_BIND mode, the only reason we would need execlist support in execbuff path is for implicit synchronization. And AFAIK, this work from Jason is expected replace implict synchronization with new ioctls. Hence, VM_BIND mode will not be needing execlist support at all.
Based on comments from Daniel and my offline sync with Jason, this new mechanism from Jason is expected work for vl. For gl, there is a question of whether it will be performant or not. But it is worth trying that first. If it is not performant for gl, then only we can consider adding implicit sync support back for VM_BIND mode.
Daniel, Jason, Ken, any thoughts you can add here?
CC'ing Ken.
+This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later.
IMHO we really need to sort this and check all the assumptions before we commit to any interface. Again, implicit synchronization is something we rely on during *every* execbuf ioctl for most workloads.
Daniel's earlier feedback was that it is worth Mesa trying this new mechanism for gl and see it that works. We want to avoid supporting execlist support for implicit sync in vm_bind mode from the beginning if it is going to be deemed not necessary.
+In VM_BIND mode, VA allocation is completely managed by the user instead of +the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_).
+So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up +by clearly separating out the functionalities where the VM_BIND mode differs +from older method and they should be moved to separate files.
I seem to recall some conversations where we were told a bunch of ioctls would stop working or make no sense to call when using vm_bind. Can we please get a complete list of those? Bonus points if the Kernel starts telling us we just called something that makes no sense.
Which ioctls you are talking about here? We do not support GEM_WAIT ioctls, but that is only for compute mode (which is already documented in this patch).
+VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
I know we already discussed this, but just to document it publicly: the ideal case for user space would be that every BO is created as private but then we'd have an ioctl to convert it to non-private (without the need to have a non-private->private interface).
An explanation on why we can't have an ioctl to mark as exported a buffer that was previously vm_private would be really appreciated.
Ok, I can some notes on that. The reason being the fact that this require changing the dma-resv object for gem object, hence the object locking also. This will add complications as we have to sync with any pending operations. It might be easier for UMDs to do it themselves by copying the object contexts to a new object.
Niranjana
Thanks, Paulo
+VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode, the +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page faults manages +residency of backing storage using dma_fence. The VM_BIND mode with page faults +and the system allocator support do not use any dma_fence at all.
+VM_BIND locking order is as below.
+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
- vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
- mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple page fault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
- The older execbuff mode of binding do not need this lock.
+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
- be held while binding/unbinding a vma in the async worker and while updating
- dma-resv fence list of an object. Note that private BOs of a VM will all
- share a dma-resv object.
- The future system allocator support will use the HMM prescribed locking
- instead.
+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
- invalidated vmas (due to eviction and userptr invalidation) etc.
+When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we need to +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for +system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm lists, +so won't ever need lock-C.
+VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid +performance degradation. We will also need support for bulk LRU movement of +VM_BIND objects to avoid additional latencies in execbuff path.
+The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call with +that VM). So, bulk LRU movement of page table pages is also needed.
+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture.
+VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency +setting (either through explicit or implicit mechanism).
+When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff submission.
+Also, in VM_BIND mode, use dma-resv apis for determining object activeness +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for igfx.
+Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated).
+Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful.
+VM_BIND Compute support +========================
+User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during +context creation. The dma-fence based user interfaces like gem_wait ioctl and +execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption.
+As this support for context suspension using a preempt fence and the resume work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute mode +contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the directly +submitted jobs.
+Other VM_BIND use cases +========================
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debugged) and attached +to GPU via vm_bind interface.
+GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of +binding will require using dma-fence to ensure residency, the GPU page faults +mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries.
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled.
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +use cases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
- it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
- is active or not.
+VM_BIND UAPI +=============
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
On 20/05/2022 01:52, Zanoni, Paulo R wrote:
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
VM_BIND design document with description of intended use cases.
v2: Add more documentation and format as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..f1be560d313c --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,304 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================
+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a +specified address space (VM). These mappings (also referred to as persistent +mappings) will be persistent across multiple GPU submissions (execbuff calls) +issued by the UMD, without user having to provide a list of all required +mappings during each submission (as required by older execbuff mode).
+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace +to specify how the binding/unbinding should sync with other operations +like the GPU job submission. These fences will be timeline 'drm_syncobj's +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences). +For Compute contexts, they will be user/memory fences (See struct +drm_i915_vm_bind_ext_user_fence).
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
+VM_BIND features include:
+* Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
- use cases will be helpful.
+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this).
+Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff mode of +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence, +no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases:
+"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/)
I would really like to have more details here. The link provided points to new ioctls and we're not very familiar with those yet, so I think you should really clarify the interaction between the new additions here. Having some sample code would be really nice too.
For Mesa at least (and I believe for the other drivers too) we always have a few exported buffers in every execbuf call, and we rely on the implicit synchronization provided by execbuf to make sure everything works. The execbuf ioctl also has some code to flush caches during implicit synchronization AFAIR, so I would guess we rely on it too and whatever else the Kernel does. Is that covered by the new ioctls?
In addition, as far as I remember, one of the big improvements of vm_bind was that it would help reduce ioctl latency and cpu overhead. But if making execbuf faster comes at the cost of requiring additional ioctls calls for implicit synchronization, which is required on ever execbuf call, then I wonder if we'll even get any faster at all. Comparing old execbuf vs plain new execbuf without the new required ioctls won't make sense. But maybe I'm wrong and we won't need to call these new ioctls around every single execbuf ioctl we submit? Again, more clarification and some code examples here would be really nice. This is a big change on an important part of the API, we should clarify the new expected usage.
Hey Paulo,
I think in the case of X11/Wayland, we'll be doing 1 or 2 extra ioctls per frame which seems pretty reasonable.
Essentially we need to set the dependencies on the buffer we´re going to tell the display engine (gnome-shell/kde/bare-display-hw) to use.
In the Vulkan case, we're trading building execbuffer lists of potentially thousands of buffers for every single submission versus 1 or 2 ioctls for a single item when doing vkQueuePresent() (which happens less often than we do execbuffer ioctls).
That seems like a good trade off and doesn't look like a lot more work than explicit fencing where we would have to send associated fences.
Here is the Mesa MR associated with this : https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
-Lionel
+This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later.
IMHO we really need to sort this and check all the assumptions before we commit to any interface. Again, implicit synchronization is something we rely on during *every* execbuf ioctl for most workloads.
+In VM_BIND mode, VA allocation is completely managed by the user instead of +the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_).
+So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up +by clearly separating out the functionalities where the VM_BIND mode differs +from older method and they should be moved to separate files.
I seem to recall some conversations where we were told a bunch of ioctls would stop working or make no sense to call when using vm_bind. Can we please get a complete list of those? Bonus points if the Kernel starts telling us we just called something that makes no sense.
+VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
I know we already discussed this, but just to document it publicly: the ideal case for user space would be that every BO is created as private but then we'd have an ioctl to convert it to non-private (without the need to have a non-private->private interface).
An explanation on why we can't have an ioctl to mark as exported a buffer that was previously vm_private would be really appreciated.
Thanks, Paulo
+VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode, the +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page faults manages +residency of backing storage using dma_fence. The VM_BIND mode with page faults +and the system allocator support do not use any dma_fence at all.
+VM_BIND locking order is as below.
+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
- vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
- mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple page fault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
- The older execbuff mode of binding do not need this lock.
+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
- be held while binding/unbinding a vma in the async worker and while updating
- dma-resv fence list of an object. Note that private BOs of a VM will all
- share a dma-resv object.
- The future system allocator support will use the HMM prescribed locking
- instead.
+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
- invalidated vmas (due to eviction and userptr invalidation) etc.
+When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we need to +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for +system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm lists, +so won't ever need lock-C.
+VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid +performance degradation. We will also need support for bulk LRU movement of +VM_BIND objects to avoid additional latencies in execbuff path.
+The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call with +that VM). So, bulk LRU movement of page table pages is also needed.
+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture.
+VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency +setting (either through explicit or implicit mechanism).
+When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff submission.
+Also, in VM_BIND mode, use dma-resv apis for determining object activeness +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for igfx.
+Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated).
+Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful.
+VM_BIND Compute support +========================
+User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during +context creation. The dma-fence based user interfaces like gem_wait ioctl and +execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption.
+As this support for context suspension using a preempt fence and the resume work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute mode +contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the directly +submitted jobs.
+Other VM_BIND use cases +========================
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debugged) and attached +to GPU via vm_bind interface.
+GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of +binding will require using dma-fence to ensure residency, the GPU page faults +mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries.
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled.
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +use cases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
- it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
- is active or not.
+VM_BIND UAPI +=============
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.
And the fact that it's happening in an async worker seem to imply it's not immediate.
I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :
VM_BIND (out_fence=fence1)
VM_BIND (out_fence=fence2)
VM_BIND (out_fence=fence3)
In what order are the fences going to be signaled?
In the order of VM_BIND ioctls? Or out of order?
Because you wrote "serialized I assume it's : in order
One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.
In Vulkan VM_BIND operations are serialized but per engine.
So you could have something like this :
VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
fence1 is not signaled
fence3 is signaled
So the second VM_BIND will proceed before the first VM_BIND.
I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.
But then it makes the VM_BIND input fences useless.
Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?
Sorry I noticed this late.
-Lionel
On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.
And the fact that it's happening in an async worker seem to imply it's not immediate.
I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :
VM_BIND (out_fence=fence1)
VM_BIND (out_fence=fence2)
VM_BIND (out_fence=fence3)
In what order are the fences going to be signaled?
In the order of VM_BIND ioctls? Or out of order?
Because you wrote "serialized I assume it's : in order
One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.
In Vulkan VM_BIND operations are serialized but per engine.
So you could have something like this :
VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
fence1 is not signaled
fence3 is signaled
So the second VM_BIND will proceed before the first VM_BIND.
I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.
But then it makes the VM_BIND input fences useless.
Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?
My opinion is rework this but make the ordering via an engine param optional.
e.g. A VM can be configured so all binds are ordered within the VM
e.g. A VM can be configured so all binds accept an engine argument (in the case of the i915 likely this is a gem context handle) and binds ordered with respect to that engine.
This gives UMDs options as the later likely consumes more KMD resources so if a different UMD can live with binds being ordered within the VM they can use a mode consuming less resources.
Matt
Sorry I noticed this late.
-Lionel
On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.
And the fact that it's happening in an async worker seem to imply it's not immediate.
Ok, will fix. This was added because in earlier design binding was deferred until next execbuff. But now it is non-deferred (immediate in that sense). But yah, this is confusing and will fix it.
I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :
VM_BIND (out_fence=fence1)
VM_BIND (out_fence=fence2)
VM_BIND (out_fence=fence3)
In what order are the fences going to be signaled?
In the order of VM_BIND ioctls? Or out of order?
Because you wrote "serialized I assume it's : in order
Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will use the same queue and hence are ordered.
One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.
In Vulkan VM_BIND operations are serialized but per engine.
So you could have something like this :
VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
fence1 is not signaled
fence3 is signaled
So the second VM_BIND will proceed before the first VM_BIND.
I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.
But then it makes the VM_BIND input fences useless.
Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?
My opinion is rework this but make the ordering via an engine param optional.
e.g. A VM can be configured so all binds are ordered within the VM
e.g. A VM can be configured so all binds accept an engine argument (in the case of the i915 likely this is a gem context handle) and binds ordered with respect to that engine.
This gives UMDs options as the later likely consumes more KMD resources so if a different UMD can live with binds being ordered within the VM they can use a mode consuming less resources.
I think we need to be careful here if we are looking for some out of (submission) order completion of vm_bind/unbind. In-order completion means, in a batch of binds and unbinds to be completed in-order, user only needs to specify in-fence for the first bind/unbind call and the our-fence for the last bind/unbind call. Also, the VA released by an unbind call can be re-used by any subsequent bind call in that in-order batch.
These things will break if binding/unbinding were to be allowed to go out of order (of submission) and user need to be extra careful not to run into pre-mature triggereing of out-fence and bind failing as VA is still in use etc.
Also, VM_BIND binds the provided mapping on the specified address space (VM). So, the uapi is not engine/context specific.
We can however add a 'queue' to the uapi which can be one from the pre-defined queues, I915_VM_BIND_QUEUE_0 I915_VM_BIND_QUEUE_1 ... I915_VM_BIND_QUEUE_(N-1)
KMD will spawn an async work queue for each queue which will only bind the mappings on that queue in the order of submission. User can assign the queue to per engine or anything like that.
But again here, user need to be careful and not deadlock these queues with circular dependency of fences.
I prefer adding this later an as extension based on whether it is really helping with the implementation.
Daniel, any thoughts?
Niranjana
Matt
Sorry I noticed this late.
-Lionel
On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura < niranjana.vishwanathapura@intel.com> wrote:
On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the
mapping in an
+async worker. The binding and unbinding will work like a special GPU
engine.
+The binding and unbinding operations are serialized and will wait on
specified
+input fences before the operation and will signal the output fences
upon the
+completion of the operation. Due to serialization, completion of an
operation
+will also indicate that all previous operations are also complete.
I guess we should avoid saying "will immediately start
binding/unbinding" if
there are fences involved.
And the fact that it's happening in an async worker seem to imply it's
not
immediate.
Ok, will fix. This was added because in earlier design binding was deferred until next execbuff. But now it is non-deferred (immediate in that sense). But yah, this is confusing and will fix it.
I have a question on the behavior of the bind operation when no input
fence
is provided. Let say I do :
VM_BIND (out_fence=fence1)
VM_BIND (out_fence=fence2)
VM_BIND (out_fence=fence3)
In what order are the fences going to be signaled?
In the order of VM_BIND ioctls? Or out of order?
Because you wrote "serialized I assume it's : in order
Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will use the same queue and hence are ordered.
One thing I didn't realize is that because we only get one "VM_BIND"
engine,
there is a disconnect from the Vulkan specification.
In Vulkan VM_BIND operations are serialized but per engine.
So you could have something like this :
VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
fence1 is not signaled
fence3 is signaled
So the second VM_BIND will proceed before the first VM_BIND.
I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.
But then it makes the VM_BIND input fences useless.
Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?
My opinion is rework this but make the ordering via an engine param
optional.
e.g. A VM can be configured so all binds are ordered within the VM
e.g. A VM can be configured so all binds accept an engine argument (in the case of the i915 likely this is a gem context handle) and binds ordered with respect to that engine.
This gives UMDs options as the later likely consumes more KMD resources so if a different UMD can live with binds being ordered within the VM they can use a mode consuming less resources.
I think we need to be careful here if we are looking for some out of (submission) order completion of vm_bind/unbind. In-order completion means, in a batch of binds and unbinds to be completed in-order, user only needs to specify in-fence for the first bind/unbind call and the our-fence for the last bind/unbind call. Also, the VA released by an unbind call can be re-used by any subsequent bind call in that in-order batch.
These things will break if binding/unbinding were to be allowed to go out of order (of submission) and user need to be extra careful not to run into pre-mature triggereing of out-fence and bind failing as VA is still in use etc.
Also, VM_BIND binds the provided mapping on the specified address space (VM). So, the uapi is not engine/context specific.
We can however add a 'queue' to the uapi which can be one from the pre-defined queues, I915_VM_BIND_QUEUE_0 I915_VM_BIND_QUEUE_1 ... I915_VM_BIND_QUEUE_(N-1)
KMD will spawn an async work queue for each queue which will only bind the mappings on that queue in the order of submission. User can assign the queue to per engine or anything like that.
But again here, user need to be careful and not deadlock these queues with circular dependency of fences.
I prefer adding this later an as extension based on whether it is really helping with the implementation.
I can tell you right now that having everything on a single in-order queue will not get us the perf we want. What vulkan really wants is one of two things:
1. No implicit ordering of VM_BIND ops. They just happen in whatever their dependencies are resolved and we ensure ordering ourselves by having a syncobj in the VkQueue.
2. The ability to create multiple VM_BIND queues. We need at least 2 but I don't see why there needs to be a limit besides the limits the i915 API already has on the number of engines. Vulkan could expose multiple sparse binding queues to the client if it's not arbitrarily limited.
Why? Because Vulkan has two basic kind of bind operations and we don't want any dependencies between them:
1. Immediate. These happen right after BO creation or maybe as part of vkBindImageMemory() or VkBindBufferMemory(). These don't happen on a queue and we don't want them serialized with anything. To synchronize with submit, we'll have a syncobj in the VkDevice which is signaled by all immediate bind operations and make submits wait on it.
2. Queued (sparse): These happen on a VkQueue which may be the same as a render/compute queue or may be its own queue. It's up to us what we want to advertise. From the Vulkan API PoV, this is like any other queue. Operations on it wait on and signal semaphores. If we have a VM_BIND engine, we'd provide syncobjs to wait and signal just like we do in execbuf().
The important thing is that we don't want one type of operation to block on the other. If immediate binds are blocking on sparse binds, it's going to cause over-synchronization issues.
In terms of the internal implementation, I know that there's going to be a lock on the VM and that we can't actually do these things in parallel. That's fine. Once the dma_fences have signaled and we're unblocked to do the bind operation, I don't care if there's a bit of synchronization due to locking. That's expected. What we can't afford to have is an immediate bind operation suddenly blocking on a sparse operation which is blocked on a compute job that's going to run for another 5ms.
For reference, Windows solves this by allowing arbitrarily many paging queues (what they call a VM_BIND engine/queue). That design works pretty well and solves the problems in question. Again, we could just make everything out-of-order and require using syncobjs to order things as userspace wants. That'd be fine too.
One more note while I'm here: danvet said something on IRC about VM_BIND queues waiting for syncobjs to materialize. We don't really want/need this. We already have all the machinery in userspace to handle wait-before-signal and waiting for syncobj fences to materialize and that machinery is on by default. It would actually take MORE work in Mesa to turn it off and take advantage of the kernel being able to wait for syncobjs to materialize. Also, getting that right is ridiculously hard and I really don't want to get it wrong in kernel space. When we do memory fences, wait-before-signal will be a thing. We don't need to try and make it a thing for syncobj.
--Jason
Daniel, any thoughts?
Niranjana
Matt
Sorry I noticed this late.
-Lionel
On 02/06/2022 23:35, Jason Ekstrand wrote:
On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote: >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote: >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an >> > +async worker. The binding and unbinding will work like a special GPU engine. >> > +The binding and unbinding operations are serialized and will wait on specified >> > +input fences before the operation and will signal the output fences upon the >> > +completion of the operation. Due to serialization, completion of an operation >> > +will also indicate that all previous operations are also complete. >> >> I guess we should avoid saying "will immediately start binding/unbinding" if >> there are fences involved. >> >> And the fact that it's happening in an async worker seem to imply it's not >> immediate. >> Ok, will fix. This was added because in earlier design binding was deferred until next execbuff. But now it is non-deferred (immediate in that sense). But yah, this is confusing and will fix it. >> >> I have a question on the behavior of the bind operation when no input fence >> is provided. Let say I do : >> >> VM_BIND (out_fence=fence1) >> >> VM_BIND (out_fence=fence2) >> >> VM_BIND (out_fence=fence3) >> >> >> In what order are the fences going to be signaled? >> >> In the order of VM_BIND ioctls? Or out of order? >> >> Because you wrote "serialized I assume it's : in order >> Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will use the same queue and hence are ordered. >> >> One thing I didn't realize is that because we only get one "VM_BIND" engine, >> there is a disconnect from the Vulkan specification. >> >> In Vulkan VM_BIND operations are serialized but per engine. >> >> So you could have something like this : >> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) >> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) >> >> >> fence1 is not signaled >> >> fence3 is signaled >> >> So the second VM_BIND will proceed before the first VM_BIND. >> >> >> I guess we can deal with that scenario in userspace by doing the wait >> ourselves in one thread per engines. >> >> But then it makes the VM_BIND input fences useless. >> >> >> Daniel : what do you think? Should be rework this or just deal with wait >> fences in userspace? >> > >My opinion is rework this but make the ordering via an engine param optional. > >e.g. A VM can be configured so all binds are ordered within the VM > >e.g. A VM can be configured so all binds accept an engine argument (in >the case of the i915 likely this is a gem context handle) and binds >ordered with respect to that engine. > >This gives UMDs options as the later likely consumes more KMD resources >so if a different UMD can live with binds being ordered within the VM >they can use a mode consuming less resources. > I think we need to be careful here if we are looking for some out of (submission) order completion of vm_bind/unbind. In-order completion means, in a batch of binds and unbinds to be completed in-order, user only needs to specify in-fence for the first bind/unbind call and the our-fence for the last bind/unbind call. Also, the VA released by an unbind call can be re-used by any subsequent bind call in that in-order batch. These things will break if binding/unbinding were to be allowed to go out of order (of submission) and user need to be extra careful not to run into pre-mature triggereing of out-fence and bind failing as VA is still in use etc. Also, VM_BIND binds the provided mapping on the specified address space (VM). So, the uapi is not engine/context specific. We can however add a 'queue' to the uapi which can be one from the pre-defined queues, I915_VM_BIND_QUEUE_0 I915_VM_BIND_QUEUE_1 ... I915_VM_BIND_QUEUE_(N-1) KMD will spawn an async work queue for each queue which will only bind the mappings on that queue in the order of submission. User can assign the queue to per engine or anything like that. But again here, user need to be careful and not deadlock these queues with circular dependency of fences. I prefer adding this later an as extension based on whether it is really helping with the implementation.
I can tell you right now that having everything on a single in-order queue will not get us the perf we want. What vulkan really wants is one of two things:
1. No implicit ordering of VM_BIND ops. They just happen in whatever their dependencies are resolved and we ensure ordering ourselves by having a syncobj in the VkQueue.
2. The ability to create multiple VM_BIND queues. We need at least 2 but I don't see why there needs to be a limit besides the limits the i915 API already has on the number of engines. Vulkan could expose multiple sparse binding queues to the client if it's not arbitrarily limited.
Why? Because Vulkan has two basic kind of bind operations and we don't want any dependencies between them:
1. Immediate. These happen right after BO creation or maybe as part of vkBindImageMemory() or VkBindBufferMemory(). These don't happen on a queue and we don't want them serialized with anything. To synchronize with submit, we'll have a syncobj in the VkDevice which is signaled by all immediate bind operations and make submits wait on it.
2. Queued (sparse): These happen on a VkQueue which may be the same as a render/compute queue or may be its own queue. It's up to us what we want to advertise. From the Vulkan API PoV, this is like any other queue. Operations on it wait on and signal semaphores. If we have a VM_BIND engine, we'd provide syncobjs to wait and signal just like we do in execbuf().
The important thing is that we don't want one type of operation to block on the other. If immediate binds are blocking on sparse binds, it's going to cause over-synchronization issues.
In terms of the internal implementation, I know that there's going to be a lock on the VM and that we can't actually do these things in parallel. That's fine. Once the dma_fences have signaled and we're unblocked to do the bind operation, I don't care if there's a bit of synchronization due to locking. That's expected. What we can't afford to have is an immediate bind operation suddenly blocking on a sparse operation which is blocked on a compute job that's going to run for another 5ms.
For reference, Windows solves this by allowing arbitrarily many paging queues (what they call a VM_BIND engine/queue). That design works pretty well and solves the problems in question. Again, we could just make everything out-of-order and require using syncobjs to order things as userspace wants. That'd be fine too.
One more note while I'm here: danvet said something on IRC about VM_BIND queues waiting for syncobjs to materialize. We don't really want/need this. We already have all the machinery in userspace to handle wait-before-signal and waiting for syncobj fences to materialize and that machinery is on by default. It would actually take MORE work in Mesa to turn it off and take advantage of the kernel being able to wait for syncobjs to materialize. Also, getting that right is ridiculously hard and I really don't want to get it wrong in kernel space. When we do memory fences, wait-before-signal will be a thing. We don't need to try and make it a thing for syncobj.
--Jason
Thanks Jason,
I missed the bit in the Vulkan spec that we're allowed to have a sparse queue that does not implement either graphics or compute operations :
"While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT support in queue families that also include
graphics and compute support, other implementations may only expose a VK_QUEUE_SPARSE_BINDING_BIT-only queue
family."
So it can all be all a vm_bind engine that just does bind/unbind operations.
But yes we need another engine for the immediate/non-sparse operations.
-Lionel
Daniel, any thoughts? Niranjana >Matt > >> >> Sorry I noticed this late. >> >> >> -Lionel >> >>
On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
On 02/06/2022 23:35, Jason Ekstrand wrote:
On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> wrote: On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote: >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote: >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an >> > +async worker. The binding and unbinding will work like a special GPU engine. >> > +The binding and unbinding operations are serialized and will wait on specified >> > +input fences before the operation and will signal the output fences upon the >> > +completion of the operation. Due to serialization, completion of an operation >> > +will also indicate that all previous operations are also complete. >> >> I guess we should avoid saying "will immediately start binding/unbinding" if >> there are fences involved. >> >> And the fact that it's happening in an async worker seem to imply it's not >> immediate. >> Ok, will fix. This was added because in earlier design binding was deferred until next execbuff. But now it is non-deferred (immediate in that sense). But yah, this is confusing and will fix it. >> >> I have a question on the behavior of the bind operation when no input fence >> is provided. Let say I do : >> >> VM_BIND (out_fence=fence1) >> >> VM_BIND (out_fence=fence2) >> >> VM_BIND (out_fence=fence3) >> >> >> In what order are the fences going to be signaled? >> >> In the order of VM_BIND ioctls? Or out of order? >> >> Because you wrote "serialized I assume it's : in order >> Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will use the same queue and hence are ordered. >> >> One thing I didn't realize is that because we only get one "VM_BIND" engine, >> there is a disconnect from the Vulkan specification. >> >> In Vulkan VM_BIND operations are serialized but per engine. >> >> So you could have something like this : >> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) >> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) >> >> >> fence1 is not signaled >> >> fence3 is signaled >> >> So the second VM_BIND will proceed before the first VM_BIND. >> >> >> I guess we can deal with that scenario in userspace by doing the wait >> ourselves in one thread per engines. >> >> But then it makes the VM_BIND input fences useless. >> >> >> Daniel : what do you think? Should be rework this or just deal with wait >> fences in userspace? >> > >My opinion is rework this but make the ordering via an engine param optional. > >e.g. A VM can be configured so all binds are ordered within the VM > >e.g. A VM can be configured so all binds accept an engine argument (in >the case of the i915 likely this is a gem context handle) and binds >ordered with respect to that engine. > >This gives UMDs options as the later likely consumes more KMD resources >so if a different UMD can live with binds being ordered within the VM >they can use a mode consuming less resources. > I think we need to be careful here if we are looking for some out of (submission) order completion of vm_bind/unbind. In-order completion means, in a batch of binds and unbinds to be completed in-order, user only needs to specify in-fence for the first bind/unbind call and the our-fence for the last bind/unbind call. Also, the VA released by an unbind call can be re-used by any subsequent bind call in that in-order batch. These things will break if binding/unbinding were to be allowed to go out of order (of submission) and user need to be extra careful not to run into pre-mature triggereing of out-fence and bind failing as VA is still in use etc. Also, VM_BIND binds the provided mapping on the specified address space (VM). So, the uapi is not engine/context specific. We can however add a 'queue' to the uapi which can be one from the pre-defined queues, I915_VM_BIND_QUEUE_0 I915_VM_BIND_QUEUE_1 ... I915_VM_BIND_QUEUE_(N-1) KMD will spawn an async work queue for each queue which will only bind the mappings on that queue in the order of submission. User can assign the queue to per engine or anything like that. But again here, user need to be careful and not deadlock these queues with circular dependency of fences. I prefer adding this later an as extension based on whether it is really helping with the implementation. I can tell you right now that having everything on a single in-order queue will not get us the perf we want. What vulkan really wants is one of two things: 1. No implicit ordering of VM_BIND ops. They just happen in whatever their dependencies are resolved and we ensure ordering ourselves by having a syncobj in the VkQueue. 2. The ability to create multiple VM_BIND queues. We need at least 2 but I don't see why there needs to be a limit besides the limits the i915 API already has on the number of engines. Vulkan could expose multiple sparse binding queues to the client if it's not arbitrarily limited.
Thanks Jason, Lionel.
Jason, what are you referring to when you say "limits the i915 API already has on the number of engines"? I am not sure if there is such an uapi today.
I am trying to see how many queues we need and don't want it to be arbitrarily large and unduely blow up memory usage and complexity in i915 driver.
Why? Because Vulkan has two basic kind of bind operations and we don't want any dependencies between them: 1. Immediate. These happen right after BO creation or maybe as part of vkBindImageMemory() or VkBindBufferMemory(). These don't happen on a queue and we don't want them serialized with anything. To synchronize with submit, we'll have a syncobj in the VkDevice which is signaled by all immediate bind operations and make submits wait on it. 2. Queued (sparse): These happen on a VkQueue which may be the same as a render/compute queue or may be its own queue. It's up to us what we want to advertise. From the Vulkan API PoV, this is like any other queue. Operations on it wait on and signal semaphores. If we have a VM_BIND engine, we'd provide syncobjs to wait and signal just like we do in execbuf(). The important thing is that we don't want one type of operation to block on the other. If immediate binds are blocking on sparse binds, it's going to cause over-synchronization issues. In terms of the internal implementation, I know that there's going to be a lock on the VM and that we can't actually do these things in parallel. That's fine. Once the dma_fences have signaled and we're
Thats correct. It is like a single VM_BIND engine with multiple queues feeding to it.
unblocked to do the bind operation, I don't care if there's a bit of synchronization due to locking. That's expected. What we can't afford to have is an immediate bind operation suddenly blocking on a sparse operation which is blocked on a compute job that's going to run for another 5ms.
As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND on other VMs. I am not sure about usecases here, but just wanted to clarify.
Niranjana
For reference, Windows solves this by allowing arbitrarily many paging queues (what they call a VM_BIND engine/queue). That design works pretty well and solves the problems in question. Again, we could just make everything out-of-order and require using syncobjs to order things as userspace wants. That'd be fine too. One more note while I'm here: danvet said something on IRC about VM_BIND queues waiting for syncobjs to materialize. We don't really want/need this. We already have all the machinery in userspace to handle wait-before-signal and waiting for syncobj fences to materialize and that machinery is on by default. It would actually take MORE work in Mesa to turn it off and take advantage of the kernel being able to wait for syncobjs to materialize. Also, getting that right is ridiculously hard and I really don't want to get it wrong in kernel space. When we do memory fences, wait-before-signal will be a thing. We don't need to try and make it a thing for syncobj. --Jason
Thanks Jason,
I missed the bit in the Vulkan spec that we're allowed to have a sparse queue that does not implement either graphics or compute operations :
"While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT support in queue families that also include graphics and compute support, other implementations may only expose a VK_QUEUE_SPARSE_BINDING_BIT-only queue family."
So it can all be all a vm_bind engine that just does bind/unbind operations.
But yes we need another engine for the immediate/non-sparse operations.
-Lionel
Daniel, any thoughts? Niranjana >Matt > >> >> Sorry I noticed this late. >> >> >> -Lionel >> >>
On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura < niranjana.vishwanathapura@intel.com> wrote:
On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
On 02/06/2022 23:35, Jason Ekstrand wrote:
On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> wrote: On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote: >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote: >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an >> > +async worker. The binding and unbinding will work like a
special
GPU engine. >> > +The binding and unbinding operations are serialized and will wait on specified >> > +input fences before the operation and will signal the output fences upon the >> > +completion of the operation. Due to serialization,
completion of
an operation >> > +will also indicate that all previous operations are also complete. >> >> I guess we should avoid saying "will immediately start binding/unbinding" if >> there are fences involved. >> >> And the fact that it's happening in an async worker seem to
imply
it's not >> immediate. >> Ok, will fix. This was added because in earlier design binding was deferred until next execbuff. But now it is non-deferred (immediate in that sense). But yah,
this is
confusing and will fix it. >> >> I have a question on the behavior of the bind operation when no input fence >> is provided. Let say I do : >> >> VM_BIND (out_fence=fence1) >> >> VM_BIND (out_fence=fence2) >> >> VM_BIND (out_fence=fence3) >> >> >> In what order are the fences going to be signaled? >> >> In the order of VM_BIND ioctls? Or out of order? >> >> Because you wrote "serialized I assume it's : in order >> Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
unbind
will use the same queue and hence are ordered. >> >> One thing I didn't realize is that because we only get one "VM_BIND" engine, >> there is a disconnect from the Vulkan specification. >> >> In Vulkan VM_BIND operations are serialized but per engine. >> >> So you could have something like this : >> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) >> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) >> >> >> fence1 is not signaled >> >> fence3 is signaled >> >> So the second VM_BIND will proceed before the first VM_BIND. >> >> >> I guess we can deal with that scenario in userspace by doing the wait >> ourselves in one thread per engines. >> >> But then it makes the VM_BIND input fences useless. >> >> >> Daniel : what do you think? Should be rework this or just deal
with
wait >> fences in userspace? >> > >My opinion is rework this but make the ordering via an engine
param
optional. > >e.g. A VM can be configured so all binds are ordered within the VM > >e.g. A VM can be configured so all binds accept an engine argument (in >the case of the i915 likely this is a gem context handle) and
binds
>ordered with respect to that engine. > >This gives UMDs options as the later likely consumes more KMD resources >so if a different UMD can live with binds being ordered within
the VM
>they can use a mode consuming less resources. > I think we need to be careful here if we are looking for some out
of
(submission) order completion of vm_bind/unbind. In-order completion means, in a batch of binds and unbinds to be completed in-order, user only needs to specify in-fence for the first bind/unbind call and the our-fence for the last bind/unbind call. Also, the VA released by an unbind call can be re-used by any subsequent bind call in that in-order batch. These things will break if binding/unbinding were to be allowed to go out of order (of submission) and user need to be extra careful not to run into pre-mature triggereing of out-fence and bind
failing
as VA is still in use etc. Also, VM_BIND binds the provided mapping on the specified address space (VM). So, the uapi is not engine/context specific. We can however add a 'queue' to the uapi which can be one from the pre-defined queues, I915_VM_BIND_QUEUE_0 I915_VM_BIND_QUEUE_1 ... I915_VM_BIND_QUEUE_(N-1) KMD will spawn an async work queue for each queue which will only bind the mappings on that queue in the order of submission. User can assign the queue to per engine or anything like that. But again here, user need to be careful and not deadlock these queues with circular dependency of fences. I prefer adding this later an as extension based on whether it is really helping with the implementation. I can tell you right now that having everything on a single in-order queue will not get us the perf we want. What vulkan really wants is
one
of two things: 1. No implicit ordering of VM_BIND ops. They just happen in
whatever
their dependencies are resolved and we ensure ordering ourselves by having a syncobj in the VkQueue. 2. The ability to create multiple VM_BIND queues. We need at least
2
but I don't see why there needs to be a limit besides the limits the i915 API already has on the number of engines. Vulkan could expose multiple sparse binding queues to the client if it's not arbitrarily limited.
Thanks Jason, Lionel.
Jason, what are you referring to when you say "limits the i915 API already has on the number of engines"? I am not sure if there is such an uapi today.
There's a limit of something like 64 total engines today based on the number of bits we can cram into the exec flags in execbuffer2. I think someone had an extended version that allowed more but I ripped it out because no one was using it. Of course, execbuffer3 might not have that problem at all.
I am trying to see how many queues we need and don't want it to be
arbitrarily large and unduely blow up memory usage and complexity in i915 driver.
I expect a Vulkan driver to use at most 2 in the vast majority of cases. I could imagine a client wanting to create more than 1 sparse queue in which case, it'll be N+1 but that's unlikely. As far as complexity goes, once you allow two, I don't think the complexity is going up by allowing N. As for memory usage, creating more queues means more memory. That's a trade-off that userspace can make. Again, the expected number here is 1 or 2 in the vast majority of cases so I don't think you need to worry.
Why? Because Vulkan has two basic kind of bind operations and we
don't
want any dependencies between them: 1. Immediate. These happen right after BO creation or maybe as
part of
vkBindImageMemory() or VkBindBufferMemory(). These don't happen on a queue and we don't want them serialized with anything. To
synchronize
with submit, we'll have a syncobj in the VkDevice which is signaled
by
all immediate bind operations and make submits wait on it. 2. Queued (sparse): These happen on a VkQueue which may be the same
as
a render/compute queue or may be its own queue. It's up to us what
we
want to advertise. From the Vulkan API PoV, this is like any other queue. Operations on it wait on and signal semaphores. If we have a VM_BIND engine, we'd provide syncobjs to wait and signal just like
we do
in execbuf(). The important thing is that we don't want one type of operation to
block
on the other. If immediate binds are blocking on sparse binds, it's going to cause over-synchronization issues. In terms of the internal implementation, I know that there's going
to be
a lock on the VM and that we can't actually do these things in parallel. That's fine. Once the dma_fences have signaled and we're
Thats correct. It is like a single VM_BIND engine with multiple queues feeding to it.
Right. As long as the queues themselves are independent and can block on dma_fences without holding up other queues, I think we're fine.
unblocked to do the bind operation, I don't care if there's a bit of synchronization due to locking. That's expected. What we can't
afford
to have is an immediate bind operation suddenly blocking on a sparse operation which is blocked on a compute job that's going to run for another 5ms.
As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND on other VMs. I am not sure about usecases here, but just wanted to clarify.
Yes, that's what I would expect.
--Jason
Niranjana
For reference, Windows solves this by allowing arbitrarily many
paging
queues (what they call a VM_BIND engine/queue). That design works pretty well and solves the problems in question. Again, we could
just
make everything out-of-order and require using syncobjs to order
things
as userspace wants. That'd be fine too. One more note while I'm here: danvet said something on IRC about
VM_BIND
queues waiting for syncobjs to materialize. We don't really
want/need
this. We already have all the machinery in userspace to handle wait-before-signal and waiting for syncobj fences to materialize and that machinery is on by default. It would actually take MORE work in Mesa to turn it off and take advantage of the kernel being able to
wait
for syncobjs to materialize. Also, getting that right is
ridiculously
hard and I really don't want to get it wrong in kernel space. When
we
do memory fences, wait-before-signal will be a thing. We don't need
to
try and make it a thing for syncobj. --Jason
Thanks Jason,
I missed the bit in the Vulkan spec that we're allowed to have a sparse queue that does not implement either graphics or compute operations :
"While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT support in queue families that also include graphics and compute support, other implementations may only expose
a
VK_QUEUE_SPARSE_BINDING_BIT-only queue family."
So it can all be all a vm_bind engine that just does bind/unbind operations.
But yes we need another engine for the immediate/non-sparse operations.
-Lionel
Daniel, any thoughts? Niranjana >Matt > >> >> Sorry I noticed this late. >> >> >> -Lionel >> >>
On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote: > On 02/06/2022 23:35, Jason Ekstrand wrote: > > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura > <niranjana.vishwanathapura@intel.com> wrote: > > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote: > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote: > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: > >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding > the mapping in an > >> > +async worker. The binding and unbinding will work like a special > GPU engine. > >> > +The binding and unbinding operations are serialized and will > wait on specified > >> > +input fences before the operation and will signal the output > fences upon the > >> > +completion of the operation. Due to serialization, completion of > an operation > >> > +will also indicate that all previous operations are also > complete. > >> > >> I guess we should avoid saying "will immediately start > binding/unbinding" if > >> there are fences involved. > >> > >> And the fact that it's happening in an async worker seem to imply > it's not > >> immediate. > >> > > Ok, will fix. > This was added because in earlier design binding was deferred until > next execbuff. > But now it is non-deferred (immediate in that sense). But yah, this is > confusing > and will fix it. > > >> > >> I have a question on the behavior of the bind operation when no > input fence > >> is provided. Let say I do : > >> > >> VM_BIND (out_fence=fence1) > >> > >> VM_BIND (out_fence=fence2) > >> > >> VM_BIND (out_fence=fence3) > >> > >> > >> In what order are the fences going to be signaled? > >> > >> In the order of VM_BIND ioctls? Or out of order? > >> > >> Because you wrote "serialized I assume it's : in order > >> > > Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind > will use > the same queue and hence are ordered. > > >> > >> One thing I didn't realize is that because we only get one > "VM_BIND" engine, > >> there is a disconnect from the Vulkan specification. > >> > >> In Vulkan VM_BIND operations are serialized but per engine. > >> > >> So you could have something like this : > >> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) > >> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) > >> > >> > >> fence1 is not signaled > >> > >> fence3 is signaled > >> > >> So the second VM_BIND will proceed before the first VM_BIND. > >> > >> > >> I guess we can deal with that scenario in userspace by doing the > wait > >> ourselves in one thread per engines. > >> > >> But then it makes the VM_BIND input fences useless. > >> > >> > >> Daniel : what do you think? Should be rework this or just deal with > wait > >> fences in userspace? > >> > > > >My opinion is rework this but make the ordering via an engine param > optional. > > > >e.g. A VM can be configured so all binds are ordered within the VM > > > >e.g. A VM can be configured so all binds accept an engine argument > (in > >the case of the i915 likely this is a gem context handle) and binds > >ordered with respect to that engine. > > > >This gives UMDs options as the later likely consumes more KMD > resources > >so if a different UMD can live with binds being ordered within the VM > >they can use a mode consuming less resources. > > > > I think we need to be careful here if we are looking for some out of > (submission) order completion of vm_bind/unbind. > In-order completion means, in a batch of binds and unbinds to be > completed in-order, user only needs to specify in-fence for the > first bind/unbind call and the our-fence for the last bind/unbind > call. Also, the VA released by an unbind call can be re-used by > any subsequent bind call in that in-order batch. > > These things will break if binding/unbinding were to be allowed to > go out of order (of submission) and user need to be extra careful > not to run into pre-mature triggereing of out-fence and bind failing > as VA is still in use etc. > > Also, VM_BIND binds the provided mapping on the specified address > space > (VM). So, the uapi is not engine/context specific. > > We can however add a 'queue' to the uapi which can be one from the > pre-defined queues, > I915_VM_BIND_QUEUE_0 > I915_VM_BIND_QUEUE_1 > ... > I915_VM_BIND_QUEUE_(N-1) > > KMD will spawn an async work queue for each queue which will only > bind the mappings on that queue in the order of submission. > User can assign the queue to per engine or anything like that. > > But again here, user need to be careful and not deadlock these > queues with circular dependency of fences. > > I prefer adding this later an as extension based on whether it > is really helping with the implementation. > > I can tell you right now that having everything on a single in-order > queue will not get us the perf we want. What vulkan really wants is one > of two things: > 1. No implicit ordering of VM_BIND ops. They just happen in whatever > their dependencies are resolved and we ensure ordering ourselves by > having a syncobj in the VkQueue. > 2. The ability to create multiple VM_BIND queues. We need at least 2 > but I don't see why there needs to be a limit besides the limits the > i915 API already has on the number of engines. Vulkan could expose > multiple sparse binding queues to the client if it's not arbitrarily > limited. Thanks Jason, Lionel. Jason, what are you referring to when you say "limits the i915 API already has on the number of engines"? I am not sure if there is such an uapi today.
There's a limit of something like 64 total engines today based on the number of bits we can cram into the exec flags in execbuffer2. I think someone had an extended version that allowed more but I ripped it out because no one was using it. Of course, execbuffer3 might not have that problem at all.
Thanks Jason. Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE and somehow export it to user (I am thinking of embedding it in I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n queues.
I am trying to see how many queues we need and don't want it to be arbitrarily large and unduely blow up memory usage and complexity in i915 driver.
I expect a Vulkan driver to use at most 2 in the vast majority of cases. I could imagine a client wanting to create more than 1 sparse queue in which case, it'll be N+1 but that's unlikely. As far as complexity goes, once you allow two, I don't think the complexity is going up by allowing N. As for memory usage, creating more queues means more memory. That's a trade-off that userspace can make. Again, the expected number here is 1 or 2 in the vast majority of cases so I don't think you need to worry.
Ok, will start with n=3 meaning 8 queues. That would require us create 8 workqueues. We can change 'n' later if required.
Niranjana
> Why? Because Vulkan has two basic kind of bind operations and we don't > want any dependencies between them: > 1. Immediate. These happen right after BO creation or maybe as part of > vkBindImageMemory() or VkBindBufferMemory(). These don't happen on a > queue and we don't want them serialized with anything. To synchronize > with submit, we'll have a syncobj in the VkDevice which is signaled by > all immediate bind operations and make submits wait on it. > 2. Queued (sparse): These happen on a VkQueue which may be the same as > a render/compute queue or may be its own queue. It's up to us what we > want to advertise. From the Vulkan API PoV, this is like any other > queue. Operations on it wait on and signal semaphores. If we have a > VM_BIND engine, we'd provide syncobjs to wait and signal just like we do > in execbuf(). > The important thing is that we don't want one type of operation to block > on the other. If immediate binds are blocking on sparse binds, it's > going to cause over-synchronization issues. > In terms of the internal implementation, I know that there's going to be > a lock on the VM and that we can't actually do these things in > parallel. That's fine. Once the dma_fences have signaled and we're Thats correct. It is like a single VM_BIND engine with multiple queues feeding to it.
Right. As long as the queues themselves are independent and can block on dma_fences without holding up other queues, I think we're fine.
> unblocked to do the bind operation, I don't care if there's a bit of > synchronization due to locking. That's expected. What we can't afford > to have is an immediate bind operation suddenly blocking on a sparse > operation which is blocked on a compute job that's going to run for > another 5ms. As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND on other VMs. I am not sure about usecases here, but just wanted to clarify.
Yes, that's what I would expect. --Jason
Niranjana > For reference, Windows solves this by allowing arbitrarily many paging > queues (what they call a VM_BIND engine/queue). That design works > pretty well and solves the problems in question. Again, we could just > make everything out-of-order and require using syncobjs to order things > as userspace wants. That'd be fine too. > One more note while I'm here: danvet said something on IRC about VM_BIND > queues waiting for syncobjs to materialize. We don't really want/need > this. We already have all the machinery in userspace to handle > wait-before-signal and waiting for syncobj fences to materialize and > that machinery is on by default. It would actually take MORE work in > Mesa to turn it off and take advantage of the kernel being able to wait > for syncobjs to materialize. Also, getting that right is ridiculously > hard and I really don't want to get it wrong in kernel space. When we > do memory fences, wait-before-signal will be a thing. We don't need to > try and make it a thing for syncobj. > --Jason > > Thanks Jason, > > I missed the bit in the Vulkan spec that we're allowed to have a sparse > queue that does not implement either graphics or compute operations : > > "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT > support in queue families that also include > > graphics and compute support, other implementations may only expose a > VK_QUEUE_SPARSE_BINDING_BIT-only queue > > family." > > So it can all be all a vm_bind engine that just does bind/unbind > operations. > > But yes we need another engine for the immediate/non-sparse operations. > > -Lionel > > > > Daniel, any thoughts? > > Niranjana > > >Matt > > > >> > >> Sorry I noticed this late. > >> > >> > >> -Lionel > >> > >>
On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
On 02/06/2022 23:35, Jason Ekstrand wrote:
On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> wrote: On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote: >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
wrote:
>> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: >> > +VM_BIND/UNBIND ioctl will immediately start
binding/unbinding
the mapping in an >> > +async worker. The binding and unbinding will work like a
special
GPU engine. >> > +The binding and unbinding operations are serialized and
will
wait on specified >> > +input fences before the operation and will signal the
output
fences upon the >> > +completion of the operation. Due to serialization,
completion of
an operation >> > +will also indicate that all previous operations are also complete. >> >> I guess we should avoid saying "will immediately start binding/unbinding" if >> there are fences involved. >> >> And the fact that it's happening in an async worker seem to
imply
it's not >> immediate. >> Ok, will fix. This was added because in earlier design binding was deferred
until
next execbuff. But now it is non-deferred (immediate in that sense). But yah,
this is
confusing and will fix it. >> >> I have a question on the behavior of the bind operation when
no
input fence >> is provided. Let say I do : >> >> VM_BIND (out_fence=fence1) >> >> VM_BIND (out_fence=fence2) >> >> VM_BIND (out_fence=fence3) >> >> >> In what order are the fences going to be signaled? >> >> In the order of VM_BIND ioctls? Or out of order? >> >> Because you wrote "serialized I assume it's : in order >> Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
unbind
will use the same queue and hence are ordered. >> >> One thing I didn't realize is that because we only get one "VM_BIND" engine, >> there is a disconnect from the Vulkan specification. >> >> In Vulkan VM_BIND operations are serialized but per engine. >> >> So you could have something like this : >> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) >> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) >> >> >> fence1 is not signaled >> >> fence3 is signaled >> >> So the second VM_BIND will proceed before the first VM_BIND. >> >> >> I guess we can deal with that scenario in userspace by doing
the
wait >> ourselves in one thread per engines. >> >> But then it makes the VM_BIND input fences useless. >> >> >> Daniel : what do you think? Should be rework this or just
deal with
wait >> fences in userspace? >> > >My opinion is rework this but make the ordering via an engine
param
optional. > >e.g. A VM can be configured so all binds are ordered within the
VM
> >e.g. A VM can be configured so all binds accept an engine
argument
(in >the case of the i915 likely this is a gem context handle) and
binds
>ordered with respect to that engine. > >This gives UMDs options as the later likely consumes more KMD resources >so if a different UMD can live with binds being ordered within
the VM
>they can use a mode consuming less resources. > I think we need to be careful here if we are looking for some
out of
(submission) order completion of vm_bind/unbind. In-order completion means, in a batch of binds and unbinds to be completed in-order, user only needs to specify in-fence for the first bind/unbind call and the our-fence for the last
bind/unbind
call. Also, the VA released by an unbind call can be re-used by any subsequent bind call in that in-order batch. These things will break if binding/unbinding were to be allowed
to
go out of order (of submission) and user need to be extra
careful
not to run into pre-mature triggereing of out-fence and bind
failing
as VA is still in use etc. Also, VM_BIND binds the provided mapping on the specified
address
space (VM). So, the uapi is not engine/context specific. We can however add a 'queue' to the uapi which can be one from
the
pre-defined queues, I915_VM_BIND_QUEUE_0 I915_VM_BIND_QUEUE_1 ... I915_VM_BIND_QUEUE_(N-1) KMD will spawn an async work queue for each queue which will
only
bind the mappings on that queue in the order of submission. User can assign the queue to per engine or anything like that. But again here, user need to be careful and not deadlock these queues with circular dependency of fences. I prefer adding this later an as extension based on whether it is really helping with the implementation. I can tell you right now that having everything on a single
in-order
queue will not get us the perf we want. What vulkan really wants
is one
of two things: 1. No implicit ordering of VM_BIND ops. They just happen in
whatever
their dependencies are resolved and we ensure ordering ourselves
by
having a syncobj in the VkQueue. 2. The ability to create multiple VM_BIND queues. We need at
least 2
but I don't see why there needs to be a limit besides the limits
the
i915 API already has on the number of engines. Vulkan could
expose
multiple sparse binding queues to the client if it's not
arbitrarily
limited.
Thanks Jason, Lionel.
Jason, what are you referring to when you say "limits the i915 API already has on the number of engines"? I am not sure if there is such an uapi today.
There's a limit of something like 64 total engines today based on the number of bits we can cram into the exec flags in execbuffer2. I think someone had an extended version that allowed more but I ripped it out because no one was using it. Of course, execbuffer3 might not have that problem at all.
Thanks Jason. Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE and somehow export it to user (I am thinking of embedding it in I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n queues.
Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3 will also have. So, we can simply define in vm_bind/unbind structures,
#define I915_VM_BIND_MAX_QUEUE 64 __u32 queue;
I think that will keep things simple.
Niranjana
I am trying to see how many queues we need and don't want it to be arbitrarily large and unduely blow up memory usage and complexity in i915 driver.
I expect a Vulkan driver to use at most 2 in the vast majority of cases. I could imagine a client wanting to create more than 1 sparse queue in which case, it'll be N+1 but that's unlikely. As far as complexity goes, once you allow two, I don't think the complexity is going up by allowing N. As for memory usage, creating more queues means more memory. That's a trade-off that userspace can make. Again, the expected number here is 1 or 2 in the vast majority of cases so I don't think you need to worry.
Ok, will start with n=3 meaning 8 queues. That would require us create 8 workqueues. We can change 'n' later if required.
Niranjana
Why? Because Vulkan has two basic kind of bind operations and we
don't
want any dependencies between them: 1. Immediate. These happen right after BO creation or maybe as
part of
vkBindImageMemory() or VkBindBufferMemory(). These don't happen
on a
queue and we don't want them serialized with anything. To
synchronize
with submit, we'll have a syncobj in the VkDevice which is
signaled by
all immediate bind operations and make submits wait on it. 2. Queued (sparse): These happen on a VkQueue which may be the
same as
a render/compute queue or may be its own queue. It's up to us
what we
want to advertise. From the Vulkan API PoV, this is like any
other
queue. Operations on it wait on and signal semaphores. If we
have a
VM_BIND engine, we'd provide syncobjs to wait and signal just like
we do
in execbuf(). The important thing is that we don't want one type of operation to
block
on the other. If immediate binds are blocking on sparse binds,
it's
going to cause over-synchronization issues. In terms of the internal implementation, I know that there's going
to be
a lock on the VM and that we can't actually do these things in parallel. That's fine. Once the dma_fences have signaled and
we're
Thats correct. It is like a single VM_BIND engine with multiple queues feeding to it.
Right. As long as the queues themselves are independent and can block on dma_fences without holding up other queues, I think we're fine.
unblocked to do the bind operation, I don't care if there's a bit
of
synchronization due to locking. That's expected. What we can't
afford
to have is an immediate bind operation suddenly blocking on a
sparse
operation which is blocked on a compute job that's going to run
for
another 5ms.
As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND on other VMs. I am not sure about usecases here, but just wanted to clarify.
Yes, that's what I would expect. --Jason
Niranjana
For reference, Windows solves this by allowing arbitrarily many
paging
queues (what they call a VM_BIND engine/queue). That design works pretty well and solves the problems in question. Again, we could
just
make everything out-of-order and require using syncobjs to order
things
as userspace wants. That'd be fine too. One more note while I'm here: danvet said something on IRC about
VM_BIND
queues waiting for syncobjs to materialize. We don't really
want/need
this. We already have all the machinery in userspace to handle wait-before-signal and waiting for syncobj fences to materialize
and
that machinery is on by default. It would actually take MORE work
in
Mesa to turn it off and take advantage of the kernel being able to
wait
for syncobjs to materialize. Also, getting that right is
ridiculously
hard and I really don't want to get it wrong in kernel
space. When we
do memory fences, wait-before-signal will be a thing. We don't
need to
try and make it a thing for syncobj. --Jason
Thanks Jason,
I missed the bit in the Vulkan spec that we're allowed to have a
sparse
queue that does not implement either graphics or compute operations
:
"While some implementations may include
VK_QUEUE_SPARSE_BINDING_BIT
support in queue families that also include graphics and compute support, other implementations may only
expose a
VK_QUEUE_SPARSE_BINDING_BIT-only queue family."
So it can all be all a vm_bind engine that just does bind/unbind operations.
But yes we need another engine for the immediate/non-sparse
operations.
-Lionel
> Daniel, any thoughts? Niranjana >Matt > >> >> Sorry I noticed this late. >> >> >> -Lionel >> >>
On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote: > On 02/06/2022 23:35, Jason Ekstrand wrote: > > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura > niranjana.vishwanathapura@intel.com wrote: > > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote: > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote: > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: > >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding > the mapping in an > >> > +async worker. The binding and unbinding will work like a special > GPU engine. > >> > +The binding and unbinding operations are serialized and will > wait on specified > >> > +input fences before the operation and will signal the output > fences upon the > >> > +completion of the operation. Due to serialization, completion of > an operation > >> > +will also indicate that all previous operations are also > complete. > >> > >> I guess we should avoid saying "will immediately start > binding/unbinding" if > >> there are fences involved. > >> > >> And the fact that it's happening in an async worker seem to imply > it's not > >> immediate. > >> > > Ok, will fix. > This was added because in earlier design binding was deferred until > next execbuff. > But now it is non-deferred (immediate in that sense). But yah, this is > confusing > and will fix it. > > >> > >> I have a question on the behavior of the bind operation when no > input fence > >> is provided. Let say I do : > >> > >> VM_BIND (out_fence=fence1) > >> > >> VM_BIND (out_fence=fence2) > >> > >> VM_BIND (out_fence=fence3) > >> > >> > >> In what order are the fences going to be signaled? > >> > >> In the order of VM_BIND ioctls? Or out of order? > >> > >> Because you wrote "serialized I assume it's : in order > >> > > Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind > will use > the same queue and hence are ordered. > > >> > >> One thing I didn't realize is that because we only get one > "VM_BIND" engine, > >> there is a disconnect from the Vulkan specification. > >> > >> In Vulkan VM_BIND operations are serialized but per engine. > >> > >> So you could have something like this : > >> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) > >> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) > >> > >> > >> fence1 is not signaled > >> > >> fence3 is signaled > >> > >> So the second VM_BIND will proceed before the first VM_BIND. > >> > >> > >> I guess we can deal with that scenario in userspace by doing the > wait > >> ourselves in one thread per engines. > >> > >> But then it makes the VM_BIND input fences useless. > >> > >> > >> Daniel : what do you think? Should be rework this or just deal with > wait > >> fences in userspace? > >> > > > >My opinion is rework this but make the ordering via an engine param > optional. > > > >e.g. A VM can be configured so all binds are ordered within the VM > > > >e.g. A VM can be configured so all binds accept an engine argument > (in > >the case of the i915 likely this is a gem context handle) and binds > >ordered with respect to that engine. > > > >This gives UMDs options as the later likely consumes more KMD > resources > >so if a different UMD can live with binds being ordered within the VM > >they can use a mode consuming less resources. > > > > I think we need to be careful here if we are looking for some out of > (submission) order completion of vm_bind/unbind. > In-order completion means, in a batch of binds and unbinds to be > completed in-order, user only needs to specify in-fence for the > first bind/unbind call and the our-fence for the last bind/unbind > call. Also, the VA released by an unbind call can be re-used by > any subsequent bind call in that in-order batch. > > These things will break if binding/unbinding were to be allowed to > go out of order (of submission) and user need to be extra careful > not to run into pre-mature triggereing of out-fence and bind failing > as VA is still in use etc. > > Also, VM_BIND binds the provided mapping on the specified address > space > (VM). So, the uapi is not engine/context specific. > > We can however add a 'queue' to the uapi which can be one from the > pre-defined queues, > I915_VM_BIND_QUEUE_0 > I915_VM_BIND_QUEUE_1 > ... > I915_VM_BIND_QUEUE_(N-1) > > KMD will spawn an async work queue for each queue which will only > bind the mappings on that queue in the order of submission. > User can assign the queue to per engine or anything like that. > > But again here, user need to be careful and not deadlock these > queues with circular dependency of fences. > > I prefer adding this later an as extension based on whether it > is really helping with the implementation. > > I can tell you right now that having everything on a single in-order > queue will not get us the perf we want. What vulkan really wants is one > of two things: > 1. No implicit ordering of VM_BIND ops. They just happen in whatever > their dependencies are resolved and we ensure ordering ourselves by > having a syncobj in the VkQueue. > 2. The ability to create multiple VM_BIND queues. We need at least 2 > but I don't see why there needs to be a limit besides the limits the > i915 API already has on the number of engines. Vulkan could expose > multiple sparse binding queues to the client if it's not arbitrarily > limited.
Thanks Jason, Lionel.
Jason, what are you referring to when you say "limits the i915 API already has on the number of engines"? I am not sure if there is such an uapi today.
There's a limit of something like 64 total engines today based on the number of bits we can cram into the exec flags in execbuffer2. I think someone had an extended version that allowed more but I ripped it out because no one was using it. Of course, execbuffer3 might not have that problem at all.
Thanks Jason. Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE and somehow export it to user (I am thinking of embedding it in I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n queues.
Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3 will also have. So, we can simply define in vm_bind/unbind structures,
#define I915_VM_BIND_MAX_QUEUE 64 __u32 queue;
I think that will keep things simple.
Hmmm? What does execbuf2 limit has to do with how many engines hardware can have? I suggest not to do that.
Change with added this:
if (set.num_engines > I915_EXEC_RING_MASK + 1) return -EINVAL;
To context creation needs to be undone and so let users create engine maps with all hardware engines, and let execbuf3 access them all.
Regards,
Tvrtko
Niranjana
I am trying to see how many queues we need and don't want it to be arbitrarily large and unduely blow up memory usage and complexity in i915 driver.
I expect a Vulkan driver to use at most 2 in the vast majority of cases. I could imagine a client wanting to create more than 1 sparse queue in which case, it'll be N+1 but that's unlikely. As far as complexity goes, once you allow two, I don't think the complexity is going up by allowing N. As for memory usage, creating more queues means more memory. That's a trade-off that userspace can make. Again, the expected number here is 1 or 2 in the vast majority of cases so I don't think you need to worry.
Ok, will start with n=3 meaning 8 queues. That would require us create 8 workqueues. We can change 'n' later if required.
Niranjana
> Why? Because Vulkan has two basic kind of bind operations and we don't > want any dependencies between them: > 1. Immediate. These happen right after BO creation or maybe as part of > vkBindImageMemory() or VkBindBufferMemory(). These don't happen on a > queue and we don't want them serialized with anything. To synchronize > with submit, we'll have a syncobj in the VkDevice which is signaled by > all immediate bind operations and make submits wait on it. > 2. Queued (sparse): These happen on a VkQueue which may be the same as > a render/compute queue or may be its own queue. It's up to us what we > want to advertise. From the Vulkan API PoV, this is like any other > queue. Operations on it wait on and signal semaphores. If we have a > VM_BIND engine, we'd provide syncobjs to wait and signal just like we do > in execbuf(). > The important thing is that we don't want one type of operation to block > on the other. If immediate binds are blocking on sparse binds, it's > going to cause over-synchronization issues. > In terms of the internal implementation, I know that there's going to be > a lock on the VM and that we can't actually do these things in > parallel. That's fine. Once the dma_fences have signaled and we're
Thats correct. It is like a single VM_BIND engine with multiple queues feeding to it.
Right. As long as the queues themselves are independent and can block on dma_fences without holding up other queues, I think we're fine.
> unblocked to do the bind operation, I don't care if there's a bit of > synchronization due to locking. That's expected. What we can't afford > to have is an immediate bind operation suddenly blocking on a sparse > operation which is blocked on a compute job that's going to run for > another 5ms.
As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND on other VMs. I am not sure about usecases here, but just wanted to clarify.
Yes, that's what I would expect. --Jason
Niranjana
> For reference, Windows solves this by allowing arbitrarily many paging > queues (what they call a VM_BIND engine/queue). That design works > pretty well and solves the problems in question. Again, we could just > make everything out-of-order and require using syncobjs to order things > as userspace wants. That'd be fine too. > One more note while I'm here: danvet said something on IRC about VM_BIND > queues waiting for syncobjs to materialize. We don't really want/need > this. We already have all the machinery in userspace to handle > wait-before-signal and waiting for syncobj fences to materialize and > that machinery is on by default. It would actually take MORE work in > Mesa to turn it off and take advantage of the kernel being able to wait > for syncobjs to materialize. Also, getting that right is ridiculously > hard and I really don't want to get it wrong in kernel space. When we > do memory fences, wait-before-signal will be a thing. We don't need to > try and make it a thing for syncobj. > --Jason > > Thanks Jason, > > I missed the bit in the Vulkan spec that we're allowed to have a sparse > queue that does not implement either graphics or compute operations : > > "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT > support in queue families that also include > > graphics and compute support, other implementations may only expose a > VK_QUEUE_SPARSE_BINDING_BIT-only queue > > family." > > So it can all be all a vm_bind engine that just does bind/unbind > operations. > > But yes we need another engine for the immediate/non-sparse operations. > > -Lionel > > > > Daniel, any thoughts? > > Niranjana > > >Matt > > > >> > >> Sorry I noticed this late. > >> > >> > >> -Lionel > >> > >>
On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote: > On 02/06/2022 23:35, Jason Ekstrand wrote: > > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura > niranjana.vishwanathapura@intel.com wrote: > > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote: > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote: > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: > >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding > the mapping in an > >> > +async worker. The binding and unbinding will work like a special > GPU engine. > >> > +The binding and unbinding operations are serialized and will > wait on specified > >> > +input fences before the operation and will signal the output > fences upon the > >> > +completion of the operation. Due to serialization, completion of > an operation > >> > +will also indicate that all previous operations are also > complete. > >> > >> I guess we should avoid saying "will immediately start > binding/unbinding" if > >> there are fences involved. > >> > >> And the fact that it's happening in an async worker seem to imply > it's not > >> immediate. > >> > > Ok, will fix. > This was added because in earlier design binding was deferred until > next execbuff. > But now it is non-deferred (immediate in that sense). But yah, this is > confusing > and will fix it. > > >> > >> I have a question on the behavior of the bind operation when no > input fence > >> is provided. Let say I do : > >> > >> VM_BIND (out_fence=fence1) > >> > >> VM_BIND (out_fence=fence2) > >> > >> VM_BIND (out_fence=fence3) > >> > >> > >> In what order are the fences going to be signaled? > >> > >> In the order of VM_BIND ioctls? Or out of order? > >> > >> Because you wrote "serialized I assume it's : in order > >> > > Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind > will use > the same queue and hence are ordered. > > >> > >> One thing I didn't realize is that because we only get one > "VM_BIND" engine, > >> there is a disconnect from the Vulkan specification. > >> > >> In Vulkan VM_BIND operations are serialized but per engine. > >> > >> So you could have something like this : > >> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) > >> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) > >> > >> > >> fence1 is not signaled > >> > >> fence3 is signaled > >> > >> So the second VM_BIND will proceed before the first VM_BIND. > >> > >> > >> I guess we can deal with that scenario in userspace by doing the > wait > >> ourselves in one thread per engines. > >> > >> But then it makes the VM_BIND input fences useless. > >> > >> > >> Daniel : what do you think? Should be rework this or just deal with > wait > >> fences in userspace? > >> > > > >My opinion is rework this but make the ordering via an engine param > optional. > > > >e.g. A VM can be configured so all binds are ordered within the VM > > > >e.g. A VM can be configured so all binds accept an engine argument > (in > >the case of the i915 likely this is a gem context handle) and binds > >ordered with respect to that engine. > > > >This gives UMDs options as the later likely consumes more KMD > resources > >so if a different UMD can live with binds being ordered within the VM > >they can use a mode consuming less resources. > > > > I think we need to be careful here if we are looking for some out of > (submission) order completion of vm_bind/unbind. > In-order completion means, in a batch of binds and unbinds to be > completed in-order, user only needs to specify in-fence for the > first bind/unbind call and the our-fence for the last bind/unbind > call. Also, the VA released by an unbind call can be re-used by > any subsequent bind call in that in-order batch. > > These things will break if binding/unbinding were to be allowed to > go out of order (of submission) and user need to be extra careful > not to run into pre-mature triggereing of out-fence and bind failing > as VA is still in use etc. > > Also, VM_BIND binds the provided mapping on the specified address > space > (VM). So, the uapi is not engine/context specific. > > We can however add a 'queue' to the uapi which can be one from the > pre-defined queues, > I915_VM_BIND_QUEUE_0 > I915_VM_BIND_QUEUE_1 > ... > I915_VM_BIND_QUEUE_(N-1) > > KMD will spawn an async work queue for each queue which will only > bind the mappings on that queue in the order of submission. > User can assign the queue to per engine or anything like that. > > But again here, user need to be careful and not deadlock these > queues with circular dependency of fences. > > I prefer adding this later an as extension based on whether it > is really helping with the implementation. > > I can tell you right now that having everything on a single in-order > queue will not get us the perf we want. What vulkan really wants is one > of two things: > 1. No implicit ordering of VM_BIND ops. They just happen in whatever > their dependencies are resolved and we ensure ordering ourselves by > having a syncobj in the VkQueue. > 2. The ability to create multiple VM_BIND queues. We need at least 2 > but I don't see why there needs to be a limit besides the limits the > i915 API already has on the number of engines. Vulkan could expose > multiple sparse binding queues to the client if it's not arbitrarily > limited.
Thanks Jason, Lionel.
Jason, what are you referring to when you say "limits the i915 API already has on the number of engines"? I am not sure if there is such an uapi today.
There's a limit of something like 64 total engines today based on the number of bits we can cram into the exec flags in execbuffer2. I think someone had an extended version that allowed more but I ripped it out because no one was using it. Of course, execbuffer3 might not have that problem at all.
Thanks Jason. Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE and somehow export it to user (I am thinking of embedding it in I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n queues.
Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3 will also have. So, we can simply define in vm_bind/unbind structures,
#define I915_VM_BIND_MAX_QUEUE 64 __u32 queue;
I think that will keep things simple.
Hmmm? What does execbuf2 limit has to do with how many engines hardware can have? I suggest not to do that.
Change with added this:
if (set.num_engines > I915_EXEC_RING_MASK + 1) return -EINVAL;
To context creation needs to be undone and so let users create engine maps with all hardware engines, and let execbuf3 access them all.
Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also. Hence, I was using the same limit for VM_BIND queues (64, or 65 if we make it N+1). But, as discussed in other thread of this RFC series, we are planning to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be any uapi that limits the number of engines (and hence the vm_bind queues need to be supported).
If we leave the number of vm_bind queues to be arbitrarily large (__u32 queue_idx) then, we need to have a hashmap for queue (a wq, work_item and a linked list) lookup from the user specified queue index. Other option is to just put some hard limit (say 64 or 65) and use an array of queues in VM (each created upon first use). I prefer this.
Niranjana
Regards,
Tvrtko
Niranjana
I am trying to see how many queues we need and don't want it to be arbitrarily large and unduely blow up memory usage and complexity in i915 driver.
I expect a Vulkan driver to use at most 2 in the vast majority of cases. I could imagine a client wanting to create more than 1 sparse queue in which case, it'll be N+1 but that's unlikely. As far as complexity goes, once you allow two, I don't think the complexity is going up by allowing N. As for memory usage, creating more queues means more memory. That's a trade-off that userspace can make. Again, the expected number here is 1 or 2 in the vast majority of cases so I don't think you need to worry.
Ok, will start with n=3 meaning 8 queues. That would require us create 8 workqueues. We can change 'n' later if required.
Niranjana
> Why? Because Vulkan has two basic kind of bind operations and we don't > want any dependencies between them: > 1. Immediate. These happen right after BO creation or maybe as part of > vkBindImageMemory() or VkBindBufferMemory(). These don't happen on a > queue and we don't want them serialized with anything. To synchronize > with submit, we'll have a syncobj in the VkDevice which is signaled by > all immediate bind operations and make submits wait on it. > 2. Queued (sparse): These happen on a VkQueue which may be the same as > a render/compute queue or may be its own queue. It's up to us what we > want to advertise. From the Vulkan API PoV, this is like any other > queue. Operations on it wait on and signal semaphores. If we have a > VM_BIND engine, we'd provide syncobjs to wait and signal just like we do > in execbuf(). > The important thing is that we don't want one type of operation to block > on the other. If immediate binds are blocking on sparse binds, it's > going to cause over-synchronization issues. > In terms of the internal implementation, I know that there's going to be > a lock on the VM and that we can't actually do these things in > parallel. That's fine. Once the dma_fences have signaled and we're
Thats correct. It is like a single VM_BIND engine with multiple queues feeding to it.
Right. As long as the queues themselves are independent and can block on dma_fences without holding up other queues, I think we're fine.
> unblocked to do the bind operation, I don't care if there's a bit of > synchronization due to locking. That's expected. What we can't afford > to have is an immediate bind operation suddenly blocking on a sparse > operation which is blocked on a compute job that's going to run for > another 5ms.
As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND on other VMs. I am not sure about usecases here, but just wanted to clarify.
Yes, that's what I would expect. --Jason
Niranjana
> For reference, Windows solves this by allowing arbitrarily many paging > queues (what they call a VM_BIND engine/queue). That design works > pretty well and solves the problems in question. Again, we could just > make everything out-of-order and require using syncobjs to order things > as userspace wants. That'd be fine too. > One more note while I'm here: danvet said something on IRC about VM_BIND > queues waiting for syncobjs to materialize. We don't really want/need > this. We already have all the machinery in userspace to handle > wait-before-signal and waiting for syncobj fences to materialize and > that machinery is on by default. It would actually take MORE work in > Mesa to turn it off and take advantage of the kernel being able to wait > for syncobjs to materialize. Also, getting that right is ridiculously > hard and I really don't want to get it wrong in kernel space. When we > do memory fences, wait-before-signal will be a thing. We don't need to > try and make it a thing for syncobj. > --Jason > > Thanks Jason, > > I missed the bit in the Vulkan spec that we're allowed to have a sparse > queue that does not implement either graphics or compute operations : > > "While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT > support in queue families that also include > > graphics and compute support, other implementations may only expose a > VK_QUEUE_SPARSE_BINDING_BIT-only queue > > family." > > So it can all be all a vm_bind engine that just does bind/unbind > operations. > > But yes we need another engine for the immediate/non-sparse operations. > > -Lionel > > > > Daniel, any thoughts? > > Niranjana > > >Matt > > > >> > >> Sorry I noticed this late. > >> > >> > >> -Lionel > >> > >>
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura < niranjana.vishwanathapura@intel.com> wrote:
On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura
wrote:
On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
On 02/06/2022 23:35, Jason Ekstrand wrote:
On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> wrote: On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
Brost wrote:
>On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
wrote:
>> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: >> > +VM_BIND/UNBIND ioctl will immediately start
binding/unbinding
the mapping in an >> > +async worker. The binding and unbinding will
work like a special
GPU engine. >> > +The binding and unbinding operations are serialized
and
will
wait on specified >> > +input fences before the operation and will signal the
output
fences upon the >> > +completion of the operation. Due to serialization,
completion of
an operation >> > +will also indicate that all previous operations
are also
complete. >> >> I guess we should avoid saying "will immediately start binding/unbinding" if >> there are fences involved. >> >> And the fact that it's happening in an async
worker seem to imply
it's not >> immediate. >> Ok, will fix. This was added because in earlier design binding was
deferred
until
next execbuff. But now it is non-deferred (immediate in that sense).
But yah, this is
confusing and will fix it. >> >> I have a question on the behavior of the bind
operation when no
input fence >> is provided. Let say I do : >> >> VM_BIND (out_fence=fence1) >> >> VM_BIND (out_fence=fence2) >> >> VM_BIND (out_fence=fence3) >> >> >> In what order are the fences going to be signaled? >> >> In the order of VM_BIND ioctls? Or out of order? >> >> Because you wrote "serialized I assume it's : in order >> Yes, in the order of VM_BIND/UNBIND ioctls. Note that
bind and unbind
will use the same queue and hence are ordered. >> >> One thing I didn't realize is that because we only get
one
"VM_BIND" engine, >> there is a disconnect from the Vulkan specification. >> >> In Vulkan VM_BIND operations are serialized but
per engine.
>> >> So you could have something like this : >> >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) >> >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) >> >> >> fence1 is not signaled >> >> fence3 is signaled >> >> So the second VM_BIND will proceed before the
first VM_BIND.
>> >> >> I guess we can deal with that scenario in
userspace by doing the
wait >> ourselves in one thread per engines. >> >> But then it makes the VM_BIND input fences useless. >> >> >> Daniel : what do you think? Should be rework this or just
deal with
wait >> fences in userspace? >> > >My opinion is rework this but make the ordering via
an engine param
optional. > >e.g. A VM can be configured so all binds are ordered
within the VM
> >e.g. A VM can be configured so all binds accept an engine
argument
(in >the case of the i915 likely this is a gem context
handle) and binds
>ordered with respect to that engine. > >This gives UMDs options as the later likely consumes
more KMD
resources >so if a different UMD can live with binds being
ordered within the VM
>they can use a mode consuming less resources. > I think we need to be careful here if we are looking for
some
out of
(submission) order completion of vm_bind/unbind. In-order completion means, in a batch of binds and
unbinds to be
completed in-order, user only needs to specify
in-fence for the
first bind/unbind call and the our-fence for the last
bind/unbind
call. Also, the VA released by an unbind call can be
re-used by
any subsequent bind call in that in-order batch. These things will break if binding/unbinding were to
be allowed to
go out of order (of submission) and user need to be extra
careful
not to run into pre-mature triggereing of out-fence and bind
failing
as VA is still in use etc. Also, VM_BIND binds the provided mapping on the specified
address
space (VM). So, the uapi is not engine/context specific. We can however add a 'queue' to the uapi which can be
one from the
pre-defined queues, I915_VM_BIND_QUEUE_0 I915_VM_BIND_QUEUE_1 ... I915_VM_BIND_QUEUE_(N-1) KMD will spawn an async work queue for each queue which will
only
bind the mappings on that queue in the order of submission. User can assign the queue to per engine or anything
like that.
But again here, user need to be careful and not
deadlock these
queues with circular dependency of fences. I prefer adding this later an as extension based on
whether it
is really helping with the implementation. I can tell you right now that having everything on a single
in-order
queue will not get us the perf we want. What vulkan
really wants is one
of two things: 1. No implicit ordering of VM_BIND ops. They just happen in
whatever
their dependencies are resolved and we ensure ordering
ourselves by
having a syncobj in the VkQueue. 2. The ability to create multiple VM_BIND queues. We need at
least 2
but I don't see why there needs to be a limit besides
the limits the
i915 API already has on the number of engines. Vulkan could
expose
multiple sparse binding queues to the client if it's not
arbitrarily
limited.
Thanks Jason, Lionel.
Jason, what are you referring to when you say "limits the i915 API already has on the number of engines"? I am not sure if there is such an
uapi
today.
There's a limit of something like 64 total engines today based on the number of bits we can cram into the exec flags in execbuffer2. I
think
someone had an extended version that allowed more but I ripped it out because no one was using it. Of course, execbuffer3 might not have that problem at all.
Thanks Jason. Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE and somehow export it to user (I am thinking of embedding it in I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n queues.
Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which
execbuf3
Yup! That's exactly the limit I was talking about.
will also have. So, we can simply define in vm_bind/unbind structures,
#define I915_VM_BIND_MAX_QUEUE 64 __u32 queue;
I think that will keep things simple.
Hmmm? What does execbuf2 limit has to do with how many engines hardware can have? I suggest not to do that.
Change with added this:
if (set.num_engines > I915_EXEC_RING_MASK + 1) return -EINVAL;
To context creation needs to be undone and so let users create engine maps with all hardware engines, and let execbuf3 access them all.
Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also. Hence, I was using the same limit for VM_BIND queues (64, or 65 if we make it N+1). But, as discussed in other thread of this RFC series, we are planning to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be any uapi that limits the number of engines (and hence the vm_bind queues need to be supported).
If we leave the number of vm_bind queues to be arbitrarily large (__u32 queue_idx) then, we need to have a hashmap for queue (a wq, work_item and a linked list) lookup from the user specified queue index. Other option is to just put some hard limit (say 64 or 65) and use an array of queues in VM (each created upon first use). I prefer this.
I don't get why a VM_BIND queue is any different from any other queue or userspace-visible kernel object. But I'll leave those details up to danvet or whoever else might be reviewing the implementation.
--Jason
Niranjana
Regards,
Tvrtko
Niranjana
I am trying to see how many queues we need and don't want it to be arbitrarily large and unduely blow up memory usage and complexity in i915
driver.
I expect a Vulkan driver to use at most 2 in the vast majority of cases. I could imagine a client wanting to create more than 1 sparse queue in which case, it'll be N+1 but that's unlikely. As far as complexity goes, once you allow two, I don't think the complexity is going up by allowing N. As for memory usage, creating more queues means more memory. That's a trade-off that userspace can make. Again, the expected number here is 1 or 2 in the vast majority of cases so I don't think you need to worry.
Ok, will start with n=3 meaning 8 queues. That would require us create 8 workqueues. We can change 'n' later if required.
Niranjana
Why? Because Vulkan has two basic kind of bind
operations and we don't
want any dependencies between them: 1. Immediate. These happen right after BO creation or
maybe as part of
vkBindImageMemory() or VkBindBufferMemory(). These
don't happen on a
queue and we don't want them serialized with anything. To
synchronize
with submit, we'll have a syncobj in the VkDevice which is
signaled by
all immediate bind operations and make submits wait on it. 2. Queued (sparse): These happen on a VkQueue which may be
the
same as
a render/compute queue or may be its own queue. It's up to us
what we
want to advertise. From the Vulkan API PoV, this is like any
other
queue. Operations on it wait on and signal semaphores. If we
have a
VM_BIND engine, we'd provide syncobjs to wait and
signal just like we do
in execbuf(). The important thing is that we don't want one type of
operation to block
on the other. If immediate binds are blocking on sparse
binds,
it's
going to cause over-synchronization issues. In terms of the internal implementation, I know that
there's going to be
a lock on the VM and that we can't actually do these things in parallel. That's fine. Once the dma_fences have signaled and
we're
Thats correct. It is like a single VM_BIND engine with multiple queues feeding to it.
Right. As long as the queues themselves are independent and can block on dma_fences without holding up other queues, I think we're fine.
unblocked to do the bind operation, I don't care if
there's a bit of
synchronization due to locking. That's expected. What
we can't afford
to have is an immediate bind operation suddenly blocking on a
sparse
operation which is blocked on a compute job that's going to
run
for
another 5ms.
As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND on other VMs. I am not sure about usecases here, but just wanted to clarify.
Yes, that's what I would expect. --Jason
Niranjana
For reference, Windows solves this by allowing arbitrarily
many
paging
queues (what they call a VM_BIND engine/queue). That
design works
pretty well and solves the problems in question.
Again, we could just
make everything out-of-order and require using syncobjs
to order things
as userspace wants. That'd be fine too. One more note while I'm here: danvet said something on
IRC about VM_BIND
queues waiting for syncobjs to materialize. We don't really
want/need
this. We already have all the machinery in userspace to
handle
wait-before-signal and waiting for syncobj fences to
materialize and
that machinery is on by default. It would actually
take MORE work in
Mesa to turn it off and take advantage of the kernel
being able to wait
for syncobjs to materialize. Also, getting that right is
ridiculously
hard and I really don't want to get it wrong in kernel
space. When we
do memory fences, wait-before-signal will be a thing. We
don't
need to
try and make it a thing for syncobj. --Jason
Thanks Jason,
I missed the bit in the Vulkan spec that we're allowed to have a
sparse
queue that does not implement either graphics or compute
operations :
"While some implementations may include
VK_QUEUE_SPARSE_BINDING_BIT
support in queue families that also include graphics and compute support, other implementations may only
expose a
VK_QUEUE_SPARSE_BINDING_BIT-only queue family."
So it can all be all a vm_bind engine that just does bind/unbind operations.
But yes we need another engine for the immediate/non-sparse
operations.
-Lionel
> Daniel, any thoughts? Niranjana >Matt > >> >> Sorry I noticed this late. >> >> >> -Lionel >> >>
On Wed, Jun 08, 2022 at 04:55:38PM -0500, Jason Ekstrand wrote:
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote: > > >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote: >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote: >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura >>>> <niranjana.vishwanathapura@intel.com> wrote: >>>> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote: >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: >>>> > >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura >>>> > <niranjana.vishwanathapura@intel.com> wrote: >>>> > >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew >>>>Brost wrote: >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin >>>> wrote: >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start >>>> binding/unbinding >>>> > the mapping in an >>>> > >> > +async worker. The binding and unbinding will >>>>work like a >>>> special >>>> > GPU engine. >>>> > >> > +The binding and unbinding operations are serialized and >>>> will >>>> > wait on specified >>>> > >> > +input fences before the operation and will signal the >>>> output >>>> > fences upon the >>>> > >> > +completion of the operation. Due to serialization, >>>> completion of >>>> > an operation >>>> > >> > +will also indicate that all previous operations >>>>are also >>>> > complete. >>>> > >> >>>> > >> I guess we should avoid saying "will immediately start >>>> > binding/unbinding" if >>>> > >> there are fences involved. >>>> > >> >>>> > >> And the fact that it's happening in an async >>>>worker seem to >>>> imply >>>> > it's not >>>> > >> immediate. >>>> > >> >>>> > >>>> > Ok, will fix. >>>> > This was added because in earlier design binding was deferred >>>> until >>>> > next execbuff. >>>> > But now it is non-deferred (immediate in that sense). >>>>But yah, >>>> this is >>>> > confusing >>>> > and will fix it. >>>> > >>>> > >> >>>> > >> I have a question on the behavior of the bind >>>>operation when >>>> no >>>> > input fence >>>> > >> is provided. Let say I do : >>>> > >> >>>> > >> VM_BIND (out_fence=fence1) >>>> > >> >>>> > >> VM_BIND (out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (out_fence=fence3) >>>> > >> >>>> > >> >>>> > >> In what order are the fences going to be signaled? >>>> > >> >>>> > >> In the order of VM_BIND ioctls? Or out of order? >>>> > >> >>>> > >> Because you wrote "serialized I assume it's : in order >>>> > >> >>>> > >>>> > Yes, in the order of VM_BIND/UNBIND ioctls. Note that >>>>bind and >>>> unbind >>>> > will use >>>> > the same queue and hence are ordered. >>>> > >>>> > >> >>>> > >> One thing I didn't realize is that because we only get one >>>> > "VM_BIND" engine, >>>> > >> there is a disconnect from the Vulkan specification. >>>> > >> >>>> > >> In Vulkan VM_BIND operations are serialized but >>>>per engine. >>>> > >> >>>> > >> So you could have something like this : >>>> > >> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) >>>> > >> >>>> > >> >>>> > >> fence1 is not signaled >>>> > >> >>>> > >> fence3 is signaled >>>> > >> >>>> > >> So the second VM_BIND will proceed before the >>>>first VM_BIND. >>>> > >> >>>> > >> >>>> > >> I guess we can deal with that scenario in >>>>userspace by doing >>>> the >>>> > wait >>>> > >> ourselves in one thread per engines. >>>> > >> >>>> > >> But then it makes the VM_BIND input fences useless. >>>> > >> >>>> > >> >>>> > >> Daniel : what do you think? Should be rework this or just >>>> deal with >>>> > wait >>>> > >> fences in userspace? >>>> > >> >>>> > > >>>> > >My opinion is rework this but make the ordering via >>>>an engine >>>> param >>>> > optional. >>>> > > >>>> > >e.g. A VM can be configured so all binds are ordered >>>>within the >>>> VM >>>> > > >>>> > >e.g. A VM can be configured so all binds accept an engine >>>> argument >>>> > (in >>>> > >the case of the i915 likely this is a gem context >>>>handle) and >>>> binds >>>> > >ordered with respect to that engine. >>>> > > >>>> > >This gives UMDs options as the later likely consumes >>>>more KMD >>>> > resources >>>> > >so if a different UMD can live with binds being >>>>ordered within >>>> the VM >>>> > >they can use a mode consuming less resources. >>>> > > >>>> > >>>> > I think we need to be careful here if we are looking for some >>>> out of >>>> > (submission) order completion of vm_bind/unbind. >>>> > In-order completion means, in a batch of binds and >>>>unbinds to be >>>> > completed in-order, user only needs to specify >>>>in-fence for the >>>> > first bind/unbind call and the our-fence for the last >>>> bind/unbind >>>> > call. Also, the VA released by an unbind call can be >>>>re-used by >>>> > any subsequent bind call in that in-order batch. >>>> > >>>> > These things will break if binding/unbinding were to >>>>be allowed >>>> to >>>> > go out of order (of submission) and user need to be extra >>>> careful >>>> > not to run into pre-mature triggereing of out-fence and bind >>>> failing >>>> > as VA is still in use etc. >>>> > >>>> > Also, VM_BIND binds the provided mapping on the specified >>>> address >>>> > space >>>> > (VM). So, the uapi is not engine/context specific. >>>> > >>>> > We can however add a 'queue' to the uapi which can be >>>>one from >>>> the >>>> > pre-defined queues, >>>> > I915_VM_BIND_QUEUE_0 >>>> > I915_VM_BIND_QUEUE_1 >>>> > ... >>>> > I915_VM_BIND_QUEUE_(N-1) >>>> > >>>> > KMD will spawn an async work queue for each queue which will >>>> only >>>> > bind the mappings on that queue in the order of submission. >>>> > User can assign the queue to per engine or anything >>>>like that. >>>> > >>>> > But again here, user need to be careful and not >>>>deadlock these >>>> > queues with circular dependency of fences. >>>> > >>>> > I prefer adding this later an as extension based on >>>>whether it >>>> > is really helping with the implementation. >>>> > >>>> > I can tell you right now that having everything on a single >>>> in-order >>>> > queue will not get us the perf we want. What vulkan >>>>really wants >>>> is one >>>> > of two things: >>>> > 1. No implicit ordering of VM_BIND ops. They just happen in >>>> whatever >>>> > their dependencies are resolved and we ensure ordering >>>>ourselves >>>> by >>>> > having a syncobj in the VkQueue. >>>> > 2. The ability to create multiple VM_BIND queues. We need at >>>> least 2 >>>> > but I don't see why there needs to be a limit besides >>>>the limits >>>> the >>>> > i915 API already has on the number of engines. Vulkan could >>>> expose >>>> > multiple sparse binding queues to the client if it's not >>>> arbitrarily >>>> > limited. >>>> >>>> Thanks Jason, Lionel. >>>> >>>> Jason, what are you referring to when you say "limits the i915 API >>>> already >>>> has on the number of engines"? I am not sure if there is such an uapi >>>> today. >>>> >>>> There's a limit of something like 64 total engines today based on the >>>> number of bits we can cram into the exec flags in execbuffer2. I think >>>> someone had an extended version that allowed more but I ripped it out >>>> because no one was using it. Of course, execbuffer3 might not >>>>have that >>>> problem at all. >>>> >>> >>>Thanks Jason. >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably >>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE >>>and somehow export it to user (I am thinking of embedding it in >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n >>>queues. >> >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
Yup! That's exactly the limit I was talking about.
>>will also have. So, we can simply define in vm_bind/unbind structures, >> >>#define I915_VM_BIND_MAX_QUEUE 64 >> __u32 queue; >> >>I think that will keep things simple. > >Hmmm? What does execbuf2 limit has to do with how many engines >hardware can have? I suggest not to do that. > >Change with added this: > > if (set.num_engines > I915_EXEC_RING_MASK + 1) > return -EINVAL; > >To context creation needs to be undone and so let users create engine >maps with all hardware engines, and let execbuf3 access them all. > Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also. Hence, I was using the same limit for VM_BIND queues (64, or 65 if we make it N+1). But, as discussed in other thread of this RFC series, we are planning to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be any uapi that limits the number of engines (and hence the vm_bind queues need to be supported). If we leave the number of vm_bind queues to be arbitrarily large (__u32 queue_idx) then, we need to have a hashmap for queue (a wq, work_item and a linked list) lookup from the user specified queue index. Other option is to just put some hard limit (say 64 or 65) and use an array of queues in VM (each created upon first use). I prefer this.
I don't get why a VM_BIND queue is any different from any other queue or userspace-visible kernel object. But I'll leave those details up to danvet or whoever else might be reviewing the implementation.
In execbuff3, if the user specified execbuf3.engine_id is beyond the number of available engines on the gem context, an error is returned to the user. In VM_BIND case, not sure how to do that bound check on user specified queue_idx.
In any case, it is an implementation detail and we can use a hashmap for the VM_BIND queues here (there might be a slight ioctl latency added due to hash lookup, but in normal case, should be insignificant), which should be Ok.
Niranjana
--Jason
Niranjana >Regards, > >Tvrtko > >> >>Niranjana >> >>> >>>> I am trying to see how many queues we need and don't want it to be >>>> arbitrarily >>>> large and unduely blow up memory usage and complexity in i915 driver. >>>> >>>> I expect a Vulkan driver to use at most 2 in the vast majority >>>>of cases. I >>>> could imagine a client wanting to create more than 1 sparse >>>>queue in which >>>> case, it'll be N+1 but that's unlikely. As far as complexity >>>>goes, once >>>> you allow two, I don't think the complexity is going up by >>>>allowing N. As >>>> for memory usage, creating more queues means more memory. That's a >>>> trade-off that userspace can make. Again, the expected number >>>>here is 1 >>>> or 2 in the vast majority of cases so I don't think you need to worry. >>> >>>Ok, will start with n=3 meaning 8 queues. >>>That would require us create 8 workqueues. >>>We can change 'n' later if required. >>> >>>Niranjana >>> >>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>>operations and we >>>> don't >>>> > want any dependencies between them: >>>> > 1. Immediate. These happen right after BO creation or >>>>maybe as >>>> part of >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >>>>don't happen >>>> on a >>>> > queue and we don't want them serialized with anything. To >>>> synchronize >>>> > with submit, we'll have a syncobj in the VkDevice which is >>>> signaled by >>>> > all immediate bind operations and make submits wait on it. >>>> > 2. Queued (sparse): These happen on a VkQueue which may be the >>>> same as >>>> > a render/compute queue or may be its own queue. It's up to us >>>> what we >>>> > want to advertise. From the Vulkan API PoV, this is like any >>>> other >>>> > queue. Operations on it wait on and signal semaphores. If we >>>> have a >>>> > VM_BIND engine, we'd provide syncobjs to wait and >>>>signal just like >>>> we do >>>> > in execbuf(). >>>> > The important thing is that we don't want one type of >>>>operation to >>>> block >>>> > on the other. If immediate binds are blocking on sparse binds, >>>> it's >>>> > going to cause over-synchronization issues. >>>> > In terms of the internal implementation, I know that >>>>there's going >>>> to be >>>> > a lock on the VM and that we can't actually do these things in >>>> > parallel. That's fine. Once the dma_fences have signaled and >>>> we're >>>> >>>> Thats correct. It is like a single VM_BIND engine with >>>>multiple queues >>>> feeding to it. >>>> >>>> Right. As long as the queues themselves are independent and >>>>can block on >>>> dma_fences without holding up other queues, I think we're fine. >>>> >>>> > unblocked to do the bind operation, I don't care if >>>>there's a bit >>>> of >>>> > synchronization due to locking. That's expected. What >>>>we can't >>>> afford >>>> > to have is an immediate bind operation suddenly blocking on a >>>> sparse >>>> > operation which is blocked on a compute job that's going to run >>>> for >>>> > another 5ms. >>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the >>>> VM_BIND >>>> on other VMs. I am not sure about usecases here, but just wanted to >>>> clarify. >>>> >>>> Yes, that's what I would expect. >>>> --Jason >>>> >>>> Niranjana >>>> >>>> > For reference, Windows solves this by allowing arbitrarily many >>>> paging >>>> > queues (what they call a VM_BIND engine/queue). That >>>>design works >>>> > pretty well and solves the problems in question. >>>>Again, we could >>>> just >>>> > make everything out-of-order and require using syncobjs >>>>to order >>>> things >>>> > as userspace wants. That'd be fine too. >>>> > One more note while I'm here: danvet said something on >>>>IRC about >>>> VM_BIND >>>> > queues waiting for syncobjs to materialize. We don't really >>>> want/need >>>> > this. We already have all the machinery in userspace to handle >>>> > wait-before-signal and waiting for syncobj fences to >>>>materialize >>>> and >>>> > that machinery is on by default. It would actually >>>>take MORE work >>>> in >>>> > Mesa to turn it off and take advantage of the kernel >>>>being able to >>>> wait >>>> > for syncobjs to materialize. Also, getting that right is >>>> ridiculously >>>> > hard and I really don't want to get it wrong in kernel >>>>space. When we >>>> > do memory fences, wait-before-signal will be a thing. We don't >>>> need to >>>> > try and make it a thing for syncobj. >>>> > --Jason >>>> > >>>> > Thanks Jason, >>>> > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a >>>> sparse >>>> > queue that does not implement either graphics or compute >>>>operations >>>> : >>>> > >>>> > "While some implementations may include >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>> > support in queue families that also include >>>> > >>>> > graphics and compute support, other implementations may only >>>> expose a >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>> > >>>> > family." >>>> > >>>> > So it can all be all a vm_bind engine that just does bind/unbind >>>> > operations. >>>> > >>>> > But yes we need another engine for the immediate/non-sparse >>>> operations. >>>> > >>>> > -Lionel >>>> > >>>> > > >>>> > Daniel, any thoughts? >>>> > >>>> > Niranjana >>>> > >>>> > >Matt >>>> > > >>>> > >> >>>> > >> Sorry I noticed this late. >>>> > >> >>>> > >> >>>> > >> -Lionel >>>> > >> >>>> > >>
On 09/06/2022 00:55, Jason Ekstrand wrote:
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote: > > >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote: >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote: >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura >>>> <niranjana.vishwanathapura@intel.com> wrote: >>>> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote: >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: >>>> > >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura >>>> > <niranjana.vishwanathapura@intel.com> wrote: >>>> > >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew >>>>Brost wrote: >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin >>>> wrote: >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start >>>> binding/unbinding >>>> > the mapping in an >>>> > >> > +async worker. The binding and unbinding will >>>>work like a >>>> special >>>> > GPU engine. >>>> > >> > +The binding and unbinding operations are serialized and >>>> will >>>> > wait on specified >>>> > >> > +input fences before the operation and will signal the >>>> output >>>> > fences upon the >>>> > >> > +completion of the operation. Due to serialization, >>>> completion of >>>> > an operation >>>> > >> > +will also indicate that all previous operations >>>>are also >>>> > complete. >>>> > >> >>>> > >> I guess we should avoid saying "will immediately start >>>> > binding/unbinding" if >>>> > >> there are fences involved. >>>> > >> >>>> > >> And the fact that it's happening in an async >>>>worker seem to >>>> imply >>>> > it's not >>>> > >> immediate. >>>> > >> >>>> > >>>> > Ok, will fix. >>>> > This was added because in earlier design binding was deferred >>>> until >>>> > next execbuff. >>>> > But now it is non-deferred (immediate in that sense). >>>>But yah, >>>> this is >>>> > confusing >>>> > and will fix it. >>>> > >>>> > >> >>>> > >> I have a question on the behavior of the bind >>>>operation when >>>> no >>>> > input fence >>>> > >> is provided. Let say I do : >>>> > >> >>>> > >> VM_BIND (out_fence=fence1) >>>> > >> >>>> > >> VM_BIND (out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (out_fence=fence3) >>>> > >> >>>> > >> >>>> > >> In what order are the fences going to be signaled? >>>> > >> >>>> > >> In the order of VM_BIND ioctls? Or out of order? >>>> > >> >>>> > >> Because you wrote "serialized I assume it's : in order >>>> > >> >>>> > >>>> > Yes, in the order of VM_BIND/UNBIND ioctls. Note that >>>>bind and >>>> unbind >>>> > will use >>>> > the same queue and hence are ordered. >>>> > >>>> > >> >>>> > >> One thing I didn't realize is that because we only get one >>>> > "VM_BIND" engine, >>>> > >> there is a disconnect from the Vulkan specification. >>>> > >> >>>> > >> In Vulkan VM_BIND operations are serialized but >>>>per engine. >>>> > >> >>>> > >> So you could have something like this : >>>> > >> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) >>>> > >> >>>> > >> >>>> > >> fence1 is not signaled >>>> > >> >>>> > >> fence3 is signaled >>>> > >> >>>> > >> So the second VM_BIND will proceed before the >>>>first VM_BIND. >>>> > >> >>>> > >> >>>> > >> I guess we can deal with that scenario in >>>>userspace by doing >>>> the >>>> > wait >>>> > >> ourselves in one thread per engines. >>>> > >> >>>> > >> But then it makes the VM_BIND input fences useless. >>>> > >> >>>> > >> >>>> > >> Daniel : what do you think? Should be rework this or just >>>> deal with >>>> > wait >>>> > >> fences in userspace? >>>> > >> >>>> > > >>>> > >My opinion is rework this but make the ordering via >>>>an engine >>>> param >>>> > optional. >>>> > > >>>> > >e.g. A VM can be configured so all binds are ordered >>>>within the >>>> VM >>>> > > >>>> > >e.g. A VM can be configured so all binds accept an engine >>>> argument >>>> > (in >>>> > >the case of the i915 likely this is a gem context >>>>handle) and >>>> binds >>>> > >ordered with respect to that engine. >>>> > > >>>> > >This gives UMDs options as the later likely consumes >>>>more KMD >>>> > resources >>>> > >so if a different UMD can live with binds being >>>>ordered within >>>> the VM >>>> > >they can use a mode consuming less resources. >>>> > > >>>> > >>>> > I think we need to be careful here if we are looking for some >>>> out of >>>> > (submission) order completion of vm_bind/unbind. >>>> > In-order completion means, in a batch of binds and >>>>unbinds to be >>>> > completed in-order, user only needs to specify >>>>in-fence for the >>>> > first bind/unbind call and the our-fence for the last >>>> bind/unbind >>>> > call. Also, the VA released by an unbind call can be >>>>re-used by >>>> > any subsequent bind call in that in-order batch. >>>> > >>>> > These things will break if binding/unbinding were to >>>>be allowed >>>> to >>>> > go out of order (of submission) and user need to be extra >>>> careful >>>> > not to run into pre-mature triggereing of out-fence and bind >>>> failing >>>> > as VA is still in use etc. >>>> > >>>> > Also, VM_BIND binds the provided mapping on the specified >>>> address >>>> > space >>>> > (VM). So, the uapi is not engine/context specific. >>>> > >>>> > We can however add a 'queue' to the uapi which can be >>>>one from >>>> the >>>> > pre-defined queues, >>>> > I915_VM_BIND_QUEUE_0 >>>> > I915_VM_BIND_QUEUE_1 >>>> > ... >>>> > I915_VM_BIND_QUEUE_(N-1) >>>> > >>>> > KMD will spawn an async work queue for each queue which will >>>> only >>>> > bind the mappings on that queue in the order of submission. >>>> > User can assign the queue to per engine or anything >>>>like that. >>>> > >>>> > But again here, user need to be careful and not >>>>deadlock these >>>> > queues with circular dependency of fences. >>>> > >>>> > I prefer adding this later an as extension based on >>>>whether it >>>> > is really helping with the implementation. >>>> > >>>> > I can tell you right now that having everything on a single >>>> in-order >>>> > queue will not get us the perf we want. What vulkan >>>>really wants >>>> is one >>>> > of two things: >>>> > 1. No implicit ordering of VM_BIND ops. They just happen in >>>> whatever >>>> > their dependencies are resolved and we ensure ordering >>>>ourselves >>>> by >>>> > having a syncobj in the VkQueue. >>>> > 2. The ability to create multiple VM_BIND queues. We need at >>>> least 2 >>>> > but I don't see why there needs to be a limit besides >>>>the limits >>>> the >>>> > i915 API already has on the number of engines. Vulkan could >>>> expose >>>> > multiple sparse binding queues to the client if it's not >>>> arbitrarily >>>> > limited. >>>> >>>> Thanks Jason, Lionel. >>>> >>>> Jason, what are you referring to when you say "limits the i915 API >>>> already >>>> has on the number of engines"? I am not sure if there is such an uapi >>>> today. >>>> >>>> There's a limit of something like 64 total engines today based on the >>>> number of bits we can cram into the exec flags in execbuffer2. I think >>>> someone had an extended version that allowed more but I ripped it out >>>> because no one was using it. Of course, execbuffer3 might not >>>>have that >>>> problem at all. >>>> >>> >>>Thanks Jason. >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably >>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE >>>and somehow export it to user (I am thinking of embedding it in >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n >>>queues. >> >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
Yup! That's exactly the limit I was talking about.
>>will also have. So, we can simply define in vm_bind/unbind structures, >> >>#define I915_VM_BIND_MAX_QUEUE 64 >> __u32 queue; >> >>I think that will keep things simple. > >Hmmm? What does execbuf2 limit has to do with how many engines >hardware can have? I suggest not to do that. > >Change with added this: > > if (set.num_engines > I915_EXEC_RING_MASK + 1) > return -EINVAL; > >To context creation needs to be undone and so let users create engine >maps with all hardware engines, and let execbuf3 access them all. > Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also. Hence, I was using the same limit for VM_BIND queues (64, or 65 if we make it N+1). But, as discussed in other thread of this RFC series, we are planning to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be any uapi that limits the number of engines (and hence the vm_bind queues need to be supported). If we leave the number of vm_bind queues to be arbitrarily large (__u32 queue_idx) then, we need to have a hashmap for queue (a wq, work_item and a linked list) lookup from the user specified queue index. Other option is to just put some hard limit (say 64 or 65) and use an array of queues in VM (each created upon first use). I prefer this.
I don't get why a VM_BIND queue is any different from any other queue or userspace-visible kernel object. But I'll leave those details up to danvet or whoever else might be reviewing the implementation.
--Jason
I kind of agree here. Wouldn't be simpler to have the bind queue created like the others when we build the engine map?
For userspace it's then just matter of selecting the right queue ID when submitting.
If there is ever a possibility to have this work on the GPU, it would be all ready.
Thanks,
-Lionel
Niranjana >Regards, > >Tvrtko > >> >>Niranjana >> >>> >>>> I am trying to see how many queues we need and don't want it to be >>>> arbitrarily >>>> large and unduely blow up memory usage and complexity in i915 driver. >>>> >>>> I expect a Vulkan driver to use at most 2 in the vast majority >>>>of cases. I >>>> could imagine a client wanting to create more than 1 sparse >>>>queue in which >>>> case, it'll be N+1 but that's unlikely. As far as complexity >>>>goes, once >>>> you allow two, I don't think the complexity is going up by >>>>allowing N. As >>>> for memory usage, creating more queues means more memory. That's a >>>> trade-off that userspace can make. Again, the expected number >>>>here is 1 >>>> or 2 in the vast majority of cases so I don't think you need to worry. >>> >>>Ok, will start with n=3 meaning 8 queues. >>>That would require us create 8 workqueues. >>>We can change 'n' later if required. >>> >>>Niranjana >>> >>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>>operations and we >>>> don't >>>> > want any dependencies between them: >>>> > 1. Immediate. These happen right after BO creation or >>>>maybe as >>>> part of >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >>>>don't happen >>>> on a >>>> > queue and we don't want them serialized with anything. To >>>> synchronize >>>> > with submit, we'll have a syncobj in the VkDevice which is >>>> signaled by >>>> > all immediate bind operations and make submits wait on it. >>>> > 2. Queued (sparse): These happen on a VkQueue which may be the >>>> same as >>>> > a render/compute queue or may be its own queue. It's up to us >>>> what we >>>> > want to advertise. From the Vulkan API PoV, this is like any >>>> other >>>> > queue. Operations on it wait on and signal semaphores. If we >>>> have a >>>> > VM_BIND engine, we'd provide syncobjs to wait and >>>>signal just like >>>> we do >>>> > in execbuf(). >>>> > The important thing is that we don't want one type of >>>>operation to >>>> block >>>> > on the other. If immediate binds are blocking on sparse binds, >>>> it's >>>> > going to cause over-synchronization issues. >>>> > In terms of the internal implementation, I know that >>>>there's going >>>> to be >>>> > a lock on the VM and that we can't actually do these things in >>>> > parallel. That's fine. Once the dma_fences have signaled and >>>> we're >>>> >>>> Thats correct. It is like a single VM_BIND engine with >>>>multiple queues >>>> feeding to it. >>>> >>>> Right. As long as the queues themselves are independent and >>>>can block on >>>> dma_fences without holding up other queues, I think we're fine. >>>> >>>> > unblocked to do the bind operation, I don't care if >>>>there's a bit >>>> of >>>> > synchronization due to locking. That's expected. What >>>>we can't >>>> afford >>>> > to have is an immediate bind operation suddenly blocking on a >>>> sparse >>>> > operation which is blocked on a compute job that's going to run >>>> for >>>> > another 5ms. >>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the >>>> VM_BIND >>>> on other VMs. I am not sure about usecases here, but just wanted to >>>> clarify. >>>> >>>> Yes, that's what I would expect. >>>> --Jason >>>> >>>> Niranjana >>>> >>>> > For reference, Windows solves this by allowing arbitrarily many >>>> paging >>>> > queues (what they call a VM_BIND engine/queue). That >>>>design works >>>> > pretty well and solves the problems in question. >>>>Again, we could >>>> just >>>> > make everything out-of-order and require using syncobjs >>>>to order >>>> things >>>> > as userspace wants. That'd be fine too. >>>> > One more note while I'm here: danvet said something on >>>>IRC about >>>> VM_BIND >>>> > queues waiting for syncobjs to materialize. We don't really >>>> want/need >>>> > this. We already have all the machinery in userspace to handle >>>> > wait-before-signal and waiting for syncobj fences to >>>>materialize >>>> and >>>> > that machinery is on by default. It would actually >>>>take MORE work >>>> in >>>> > Mesa to turn it off and take advantage of the kernel >>>>being able to >>>> wait >>>> > for syncobjs to materialize. Also, getting that right is >>>> ridiculously >>>> > hard and I really don't want to get it wrong in kernel >>>>space. When we >>>> > do memory fences, wait-before-signal will be a thing. We don't >>>> need to >>>> > try and make it a thing for syncobj. >>>> > --Jason >>>> > >>>> > Thanks Jason, >>>> > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a >>>> sparse >>>> > queue that does not implement either graphics or compute >>>>operations >>>> : >>>> > >>>> > "While some implementations may include >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>> > support in queue families that also include >>>> > >>>> > graphics and compute support, other implementations may only >>>> expose a >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>> > >>>> > family." >>>> > >>>> > So it can all be all a vm_bind engine that just does bind/unbind >>>> > operations. >>>> > >>>> > But yes we need another engine for the immediate/non-sparse >>>> operations. >>>> > >>>> > -Lionel >>>> > >>>> > > >>>> > Daniel, any thoughts? >>>> > >>>> > Niranjana >>>> > >>>> > >Matt >>>> > > >>>> > >> >>>> > >> Sorry I noticed this late. >>>> > >> >>>> > >> >>>> > >> -Lionel >>>> > >> >>>> > >>
On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
On 09/06/2022 00:55, Jason Ekstrand wrote:
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> wrote: On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote: > > >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote: >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote: >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura >>>> <niranjana.vishwanathapura@intel.com> wrote: >>>> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote: >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: >>>> > >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura >>>> > <niranjana.vishwanathapura@intel.com> wrote: >>>> > >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew >>>>Brost wrote: >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin >>>> wrote: >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start >>>> binding/unbinding >>>> > the mapping in an >>>> > >> > +async worker. The binding and unbinding will >>>>work like a >>>> special >>>> > GPU engine. >>>> > >> > +The binding and unbinding operations are serialized and >>>> will >>>> > wait on specified >>>> > >> > +input fences before the operation and will signal the >>>> output >>>> > fences upon the >>>> > >> > +completion of the operation. Due to serialization, >>>> completion of >>>> > an operation >>>> > >> > +will also indicate that all previous operations >>>>are also >>>> > complete. >>>> > >> >>>> > >> I guess we should avoid saying "will immediately start >>>> > binding/unbinding" if >>>> > >> there are fences involved. >>>> > >> >>>> > >> And the fact that it's happening in an async >>>>worker seem to >>>> imply >>>> > it's not >>>> > >> immediate. >>>> > >> >>>> > >>>> > Ok, will fix. >>>> > This was added because in earlier design binding was deferred >>>> until >>>> > next execbuff. >>>> > But now it is non-deferred (immediate in that sense). >>>>But yah, >>>> this is >>>> > confusing >>>> > and will fix it. >>>> > >>>> > >> >>>> > >> I have a question on the behavior of the bind >>>>operation when >>>> no >>>> > input fence >>>> > >> is provided. Let say I do : >>>> > >> >>>> > >> VM_BIND (out_fence=fence1) >>>> > >> >>>> > >> VM_BIND (out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (out_fence=fence3) >>>> > >> >>>> > >> >>>> > >> In what order are the fences going to be signaled? >>>> > >> >>>> > >> In the order of VM_BIND ioctls? Or out of order? >>>> > >> >>>> > >> Because you wrote "serialized I assume it's : in order >>>> > >> >>>> > >>>> > Yes, in the order of VM_BIND/UNBIND ioctls. Note that >>>>bind and >>>> unbind >>>> > will use >>>> > the same queue and hence are ordered. >>>> > >>>> > >> >>>> > >> One thing I didn't realize is that because we only get one >>>> > "VM_BIND" engine, >>>> > >> there is a disconnect from the Vulkan specification. >>>> > >> >>>> > >> In Vulkan VM_BIND operations are serialized but >>>>per engine. >>>> > >> >>>> > >> So you could have something like this : >>>> > >> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) >>>> > >> >>>> > >> >>>> > >> fence1 is not signaled >>>> > >> >>>> > >> fence3 is signaled >>>> > >> >>>> > >> So the second VM_BIND will proceed before the >>>>first VM_BIND. >>>> > >> >>>> > >> >>>> > >> I guess we can deal with that scenario in >>>>userspace by doing >>>> the >>>> > wait >>>> > >> ourselves in one thread per engines. >>>> > >> >>>> > >> But then it makes the VM_BIND input fences useless. >>>> > >> >>>> > >> >>>> > >> Daniel : what do you think? Should be rework this or just >>>> deal with >>>> > wait >>>> > >> fences in userspace? >>>> > >> >>>> > > >>>> > >My opinion is rework this but make the ordering via >>>>an engine >>>> param >>>> > optional. >>>> > > >>>> > >e.g. A VM can be configured so all binds are ordered >>>>within the >>>> VM >>>> > > >>>> > >e.g. A VM can be configured so all binds accept an engine >>>> argument >>>> > (in >>>> > >the case of the i915 likely this is a gem context >>>>handle) and >>>> binds >>>> > >ordered with respect to that engine. >>>> > > >>>> > >This gives UMDs options as the later likely consumes >>>>more KMD >>>> > resources >>>> > >so if a different UMD can live with binds being >>>>ordered within >>>> the VM >>>> > >they can use a mode consuming less resources. >>>> > > >>>> > >>>> > I think we need to be careful here if we are looking for some >>>> out of >>>> > (submission) order completion of vm_bind/unbind. >>>> > In-order completion means, in a batch of binds and >>>>unbinds to be >>>> > completed in-order, user only needs to specify >>>>in-fence for the >>>> > first bind/unbind call and the our-fence for the last >>>> bind/unbind >>>> > call. Also, the VA released by an unbind call can be >>>>re-used by >>>> > any subsequent bind call in that in-order batch. >>>> > >>>> > These things will break if binding/unbinding were to >>>>be allowed >>>> to >>>> > go out of order (of submission) and user need to be extra >>>> careful >>>> > not to run into pre-mature triggereing of out-fence and bind >>>> failing >>>> > as VA is still in use etc. >>>> > >>>> > Also, VM_BIND binds the provided mapping on the specified >>>> address >>>> > space >>>> > (VM). So, the uapi is not engine/context specific. >>>> > >>>> > We can however add a 'queue' to the uapi which can be >>>>one from >>>> the >>>> > pre-defined queues, >>>> > I915_VM_BIND_QUEUE_0 >>>> > I915_VM_BIND_QUEUE_1 >>>> > ... >>>> > I915_VM_BIND_QUEUE_(N-1) >>>> > >>>> > KMD will spawn an async work queue for each queue which will >>>> only >>>> > bind the mappings on that queue in the order of submission. >>>> > User can assign the queue to per engine or anything >>>>like that. >>>> > >>>> > But again here, user need to be careful and not >>>>deadlock these >>>> > queues with circular dependency of fences. >>>> > >>>> > I prefer adding this later an as extension based on >>>>whether it >>>> > is really helping with the implementation. >>>> > >>>> > I can tell you right now that having everything on a single >>>> in-order >>>> > queue will not get us the perf we want. What vulkan >>>>really wants >>>> is one >>>> > of two things: >>>> > 1. No implicit ordering of VM_BIND ops. They just happen in >>>> whatever >>>> > their dependencies are resolved and we ensure ordering >>>>ourselves >>>> by >>>> > having a syncobj in the VkQueue. >>>> > 2. The ability to create multiple VM_BIND queues. We need at >>>> least 2 >>>> > but I don't see why there needs to be a limit besides >>>>the limits >>>> the >>>> > i915 API already has on the number of engines. Vulkan could >>>> expose >>>> > multiple sparse binding queues to the client if it's not >>>> arbitrarily >>>> > limited. >>>> >>>> Thanks Jason, Lionel. >>>> >>>> Jason, what are you referring to when you say "limits the i915 API >>>> already >>>> has on the number of engines"? I am not sure if there is such an uapi >>>> today. >>>> >>>> There's a limit of something like 64 total engines today based on the >>>> number of bits we can cram into the exec flags in execbuffer2. I think >>>> someone had an extended version that allowed more but I ripped it out >>>> because no one was using it. Of course, execbuffer3 might not >>>>have that >>>> problem at all. >>>> >>> >>>Thanks Jason. >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably >>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE >>>and somehow export it to user (I am thinking of embedding it in >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n >>>queues. >> >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3 Yup! That's exactly the limit I was talking about. >>will also have. So, we can simply define in vm_bind/unbind structures, >> >>#define I915_VM_BIND_MAX_QUEUE 64 >> __u32 queue; >> >>I think that will keep things simple. > >Hmmm? What does execbuf2 limit has to do with how many engines >hardware can have? I suggest not to do that. > >Change with added this: > > if (set.num_engines > I915_EXEC_RING_MASK + 1) > return -EINVAL; > >To context creation needs to be undone and so let users create engine >maps with all hardware engines, and let execbuf3 access them all. > Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also. Hence, I was using the same limit for VM_BIND queues (64, or 65 if we make it N+1). But, as discussed in other thread of this RFC series, we are planning to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be any uapi that limits the number of engines (and hence the vm_bind queues need to be supported). If we leave the number of vm_bind queues to be arbitrarily large (__u32 queue_idx) then, we need to have a hashmap for queue (a wq, work_item and a linked list) lookup from the user specified queue index. Other option is to just put some hard limit (say 64 or 65) and use an array of queues in VM (each created upon first use). I prefer this. I don't get why a VM_BIND queue is any different from any other queue or userspace-visible kernel object. But I'll leave those details up to danvet or whoever else might be reviewing the implementation. --Jason
I kind of agree here. Wouldn't be simpler to have the bind queue created like the others when we build the engine map?
For userspace it's then just matter of selecting the right queue ID when submitting.
If there is ever a possibility to have this work on the GPU, it would be all ready.
I did sync offline with Matt Brost on this. We can add a VM_BIND engine class and let user create VM_BIND engines (queues). The problem is, in i915 engine creating interface is bound to gem_context. So, in vm_bind ioctl, we would need both context_id and queue_idx for proper lookup of the user created engine. This is bit ackward as vm_bind is an interface to VM (address space) and has nothing to do with gem_context. Another problem is, if two VMs are binding with the same defined engine, binding on VM1 can get unnecessary blocked by binding on VM2 (which may be waiting on its in_fence).
So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind ioctl, and the queues are per VM.
Niranjana
Thanks,
-Lionel
Niranjana >Regards, > >Tvrtko > >> >>Niranjana >> >>> >>>> I am trying to see how many queues we need and don't want it to be >>>> arbitrarily >>>> large and unduely blow up memory usage and complexity in i915 driver. >>>> >>>> I expect a Vulkan driver to use at most 2 in the vast majority >>>>of cases. I >>>> could imagine a client wanting to create more than 1 sparse >>>>queue in which >>>> case, it'll be N+1 but that's unlikely. As far as complexity >>>>goes, once >>>> you allow two, I don't think the complexity is going up by >>>>allowing N. As >>>> for memory usage, creating more queues means more memory. That's a >>>> trade-off that userspace can make. Again, the expected number >>>>here is 1 >>>> or 2 in the vast majority of cases so I don't think you need to worry. >>> >>>Ok, will start with n=3 meaning 8 queues. >>>That would require us create 8 workqueues. >>>We can change 'n' later if required. >>> >>>Niranjana >>> >>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>>operations and we >>>> don't >>>> > want any dependencies between them: >>>> > 1. Immediate. These happen right after BO creation or >>>>maybe as >>>> part of >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >>>>don't happen >>>> on a >>>> > queue and we don't want them serialized with anything. To >>>> synchronize >>>> > with submit, we'll have a syncobj in the VkDevice which is >>>> signaled by >>>> > all immediate bind operations and make submits wait on it. >>>> > 2. Queued (sparse): These happen on a VkQueue which may be the >>>> same as >>>> > a render/compute queue or may be its own queue. It's up to us >>>> what we >>>> > want to advertise. From the Vulkan API PoV, this is like any >>>> other >>>> > queue. Operations on it wait on and signal semaphores. If we >>>> have a >>>> > VM_BIND engine, we'd provide syncobjs to wait and >>>>signal just like >>>> we do >>>> > in execbuf(). >>>> > The important thing is that we don't want one type of >>>>operation to >>>> block >>>> > on the other. If immediate binds are blocking on sparse binds, >>>> it's >>>> > going to cause over-synchronization issues. >>>> > In terms of the internal implementation, I know that >>>>there's going >>>> to be >>>> > a lock on the VM and that we can't actually do these things in >>>> > parallel. That's fine. Once the dma_fences have signaled and >>>> we're >>>> >>>> Thats correct. It is like a single VM_BIND engine with >>>>multiple queues >>>> feeding to it. >>>> >>>> Right. As long as the queues themselves are independent and >>>>can block on >>>> dma_fences without holding up other queues, I think we're fine. >>>> >>>> > unblocked to do the bind operation, I don't care if >>>>there's a bit >>>> of >>>> > synchronization due to locking. That's expected. What >>>>we can't >>>> afford >>>> > to have is an immediate bind operation suddenly blocking on a >>>> sparse >>>> > operation which is blocked on a compute job that's going to run >>>> for >>>> > another 5ms. >>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the >>>> VM_BIND >>>> on other VMs. I am not sure about usecases here, but just wanted to >>>> clarify. >>>> >>>> Yes, that's what I would expect. >>>> --Jason >>>> >>>> Niranjana >>>> >>>> > For reference, Windows solves this by allowing arbitrarily many >>>> paging >>>> > queues (what they call a VM_BIND engine/queue). That >>>>design works >>>> > pretty well and solves the problems in question. >>>>Again, we could >>>> just >>>> > make everything out-of-order and require using syncobjs >>>>to order >>>> things >>>> > as userspace wants. That'd be fine too. >>>> > One more note while I'm here: danvet said something on >>>>IRC about >>>> VM_BIND >>>> > queues waiting for syncobjs to materialize. We don't really >>>> want/need >>>> > this. We already have all the machinery in userspace to handle >>>> > wait-before-signal and waiting for syncobj fences to >>>>materialize >>>> and >>>> > that machinery is on by default. It would actually >>>>take MORE work >>>> in >>>> > Mesa to turn it off and take advantage of the kernel >>>>being able to >>>> wait >>>> > for syncobjs to materialize. Also, getting that right is >>>> ridiculously >>>> > hard and I really don't want to get it wrong in kernel >>>>space. When we >>>> > do memory fences, wait-before-signal will be a thing. We don't >>>> need to >>>> > try and make it a thing for syncobj. >>>> > --Jason >>>> > >>>> > Thanks Jason, >>>> > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a >>>> sparse >>>> > queue that does not implement either graphics or compute >>>>operations >>>> : >>>> > >>>> > "While some implementations may include >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>> > support in queue families that also include >>>> > >>>> > graphics and compute support, other implementations may only >>>> expose a >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>> > >>>> > family." >>>> > >>>> > So it can all be all a vm_bind engine that just does bind/unbind >>>> > operations. >>>> > >>>> > But yes we need another engine for the immediate/non-sparse >>>> operations. >>>> > >>>> > -Lionel >>>> > >>>> > > >>>> > Daniel, any thoughts? >>>> > >>>> > Niranjana >>>> > >>>> > >Matt >>>> > > >>>> > >> >>>> > >> Sorry I noticed this late. >>>> > >> >>>> > >> >>>> > >> -Lionel >>>> > >> >>>> > >>
On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
On 09/06/2022 00:55, Jason Ekstrand wrote:
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote: > > >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote: >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote: >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura >>>> niranjana.vishwanathapura@intel.com wrote: >>>> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote: >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: >>>> > >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura >>>> > niranjana.vishwanathapura@intel.com wrote: >>>> > >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew >>>>Brost wrote: >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin >>>> wrote: >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start >>>> binding/unbinding >>>> > the mapping in an >>>> > >> > +async worker. The binding and unbinding will >>>>work like a >>>> special >>>> > GPU engine. >>>> > >> > +The binding and unbinding operations are serialized and >>>> will >>>> > wait on specified >>>> > >> > +input fences before the operation and will signal the >>>> output >>>> > fences upon the >>>> > >> > +completion of the operation. Due to serialization, >>>> completion of >>>> > an operation >>>> > >> > +will also indicate that all previous operations >>>>are also >>>> > complete. >>>> > >> >>>> > >> I guess we should avoid saying "will immediately start >>>> > binding/unbinding" if >>>> > >> there are fences involved. >>>> > >> >>>> > >> And the fact that it's happening in an async >>>>worker seem to >>>> imply >>>> > it's not >>>> > >> immediate. >>>> > >> >>>> > >>>> > Ok, will fix. >>>> > This was added because in earlier design binding was deferred >>>> until >>>> > next execbuff. >>>> > But now it is non-deferred (immediate in that sense). >>>>But yah, >>>> this is >>>> > confusing >>>> > and will fix it. >>>> > >>>> > >> >>>> > >> I have a question on the behavior of the bind >>>>operation when >>>> no >>>> > input fence >>>> > >> is provided. Let say I do : >>>> > >> >>>> > >> VM_BIND (out_fence=fence1) >>>> > >> >>>> > >> VM_BIND (out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (out_fence=fence3) >>>> > >> >>>> > >> >>>> > >> In what order are the fences going to be signaled? >>>> > >> >>>> > >> In the order of VM_BIND ioctls? Or out of order? >>>> > >> >>>> > >> Because you wrote "serialized I assume it's : in order >>>> > >> >>>> > >>>> > Yes, in the order of VM_BIND/UNBIND ioctls. Note that >>>>bind and >>>> unbind >>>> > will use >>>> > the same queue and hence are ordered. >>>> > >>>> > >> >>>> > >> One thing I didn't realize is that because we only get one >>>> > "VM_BIND" engine, >>>> > >> there is a disconnect from the Vulkan specification. >>>> > >> >>>> > >> In Vulkan VM_BIND operations are serialized but >>>>per engine. >>>> > >> >>>> > >> So you could have something like this : >>>> > >> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) >>>> > >> >>>> > >> >>>> > >> fence1 is not signaled >>>> > >> >>>> > >> fence3 is signaled >>>> > >> >>>> > >> So the second VM_BIND will proceed before the >>>>first VM_BIND. >>>> > >> >>>> > >> >>>> > >> I guess we can deal with that scenario in >>>>userspace by doing >>>> the >>>> > wait >>>> > >> ourselves in one thread per engines. >>>> > >> >>>> > >> But then it makes the VM_BIND input fences useless. >>>> > >> >>>> > >> >>>> > >> Daniel : what do you think? Should be rework this or just >>>> deal with >>>> > wait >>>> > >> fences in userspace? >>>> > >> >>>> > > >>>> > >My opinion is rework this but make the ordering via >>>>an engine >>>> param >>>> > optional. >>>> > > >>>> > >e.g. A VM can be configured so all binds are ordered >>>>within the >>>> VM >>>> > > >>>> > >e.g. A VM can be configured so all binds accept an engine >>>> argument >>>> > (in >>>> > >the case of the i915 likely this is a gem context >>>>handle) and >>>> binds >>>> > >ordered with respect to that engine. >>>> > > >>>> > >This gives UMDs options as the later likely consumes >>>>more KMD >>>> > resources >>>> > >so if a different UMD can live with binds being >>>>ordered within >>>> the VM >>>> > >they can use a mode consuming less resources. >>>> > > >>>> > >>>> > I think we need to be careful here if we are looking for some >>>> out of >>>> > (submission) order completion of vm_bind/unbind. >>>> > In-order completion means, in a batch of binds and >>>>unbinds to be >>>> > completed in-order, user only needs to specify >>>>in-fence for the >>>> > first bind/unbind call and the our-fence for the last >>>> bind/unbind >>>> > call. Also, the VA released by an unbind call can be >>>>re-used by >>>> > any subsequent bind call in that in-order batch. >>>> > >>>> > These things will break if binding/unbinding were to >>>>be allowed >>>> to >>>> > go out of order (of submission) and user need to be extra >>>> careful >>>> > not to run into pre-mature triggereing of out-fence and bind >>>> failing >>>> > as VA is still in use etc. >>>> > >>>> > Also, VM_BIND binds the provided mapping on the specified >>>> address >>>> > space >>>> > (VM). So, the uapi is not engine/context specific. >>>> > >>>> > We can however add a 'queue' to the uapi which can be >>>>one from >>>> the >>>> > pre-defined queues, >>>> > I915_VM_BIND_QUEUE_0 >>>> > I915_VM_BIND_QUEUE_1 >>>> > ... >>>> > I915_VM_BIND_QUEUE_(N-1) >>>> > >>>> > KMD will spawn an async work queue for each queue which will >>>> only >>>> > bind the mappings on that queue in the order of submission. >>>> > User can assign the queue to per engine or anything >>>>like that. >>>> > >>>> > But again here, user need to be careful and not >>>>deadlock these >>>> > queues with circular dependency of fences. >>>> > >>>> > I prefer adding this later an as extension based on >>>>whether it >>>> > is really helping with the implementation. >>>> > >>>> > I can tell you right now that having everything on a single >>>> in-order >>>> > queue will not get us the perf we want. What vulkan >>>>really wants >>>> is one >>>> > of two things: >>>> > 1. No implicit ordering of VM_BIND ops. They just happen in >>>> whatever >>>> > their dependencies are resolved and we ensure ordering >>>>ourselves >>>> by >>>> > having a syncobj in the VkQueue. >>>> > 2. The ability to create multiple VM_BIND queues. We need at >>>> least 2 >>>> > but I don't see why there needs to be a limit besides >>>>the limits >>>> the >>>> > i915 API already has on the number of engines. Vulkan could >>>> expose >>>> > multiple sparse binding queues to the client if it's not >>>> arbitrarily >>>> > limited. >>>> >>>> Thanks Jason, Lionel. >>>> >>>> Jason, what are you referring to when you say "limits the i915 API >>>> already >>>> has on the number of engines"? I am not sure if there is such an uapi >>>> today. >>>> >>>> There's a limit of something like 64 total engines today based on the >>>> number of bits we can cram into the exec flags in execbuffer2. I think >>>> someone had an extended version that allowed more but I ripped it out >>>> because no one was using it. Of course, execbuffer3 might not >>>>have that >>>> problem at all. >>>> >>> >>>Thanks Jason. >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably >>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE >>>and somehow export it to user (I am thinking of embedding it in >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n >>>queues. >> >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
Yup! That's exactly the limit I was talking about.
>>will also have. So, we can simply define in vm_bind/unbind structures, >> >>#define I915_VM_BIND_MAX_QUEUE 64 >> __u32 queue; >> >>I think that will keep things simple. > >Hmmm? What does execbuf2 limit has to do with how many engines >hardware can have? I suggest not to do that. > >Change with added this: > > if (set.num_engines > I915_EXEC_RING_MASK + 1) > return -EINVAL; > >To context creation needs to be undone and so let users create engine >maps with all hardware engines, and let execbuf3 access them all. >
Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also. Hence, I was using the same limit for VM_BIND queues (64, or 65 if we make it N+1). But, as discussed in other thread of this RFC series, we are planning to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be any uapi that limits the number of engines (and hence the vm_bind queues need to be supported).
If we leave the number of vm_bind queues to be arbitrarily large (__u32 queue_idx) then, we need to have a hashmap for queue (a wq, work_item and a linked list) lookup from the user specified queue index. Other option is to just put some hard limit (say 64 or 65) and use an array of queues in VM (each created upon first use). I prefer this.
I don't get why a VM_BIND queue is any different from any other queue or userspace-visible kernel object. But I'll leave those details up to danvet or whoever else might be reviewing the implementation. --Jason
I kind of agree here. Wouldn't be simpler to have the bind queue created like the others when we build the engine map?
For userspace it's then just matter of selecting the right queue ID when submitting.
If there is ever a possibility to have this work on the GPU, it would be all ready.
I did sync offline with Matt Brost on this. We can add a VM_BIND engine class and let user create VM_BIND engines (queues). The problem is, in i915 engine creating interface is bound to gem_context. So, in vm_bind ioctl, we would need both context_id and queue_idx for proper lookup of the user created engine. This is bit ackward as vm_bind is an interface to VM (address space) and has nothing to do with gem_context.
A gem_context has a single vm object right?
Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.
So it's just like picking up the vm like it's done at execbuffer time right now : eb->context->vm
Another problem is, if two VMs are binding with the same defined engine, binding on VM1 can get unnecessary blocked by binding on VM2 (which may be waiting on its in_fence).
Maybe I'm missing something, but how can you have 2 vm objects with a single gem_context right now?
So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind ioctl, and the queues are per VM.
Niranjana
Thanks,
-Lionel
Niranjana
>Regards, > >Tvrtko > >> >>Niranjana >> >>> >>>> I am trying to see how many queues we need and don't want it to be >>>> arbitrarily >>>> large and unduely blow up memory usage and complexity in i915 driver. >>>> >>>> I expect a Vulkan driver to use at most 2 in the vast majority >>>>of cases. I >>>> could imagine a client wanting to create more than 1 sparse >>>>queue in which >>>> case, it'll be N+1 but that's unlikely. As far as complexity >>>>goes, once >>>> you allow two, I don't think the complexity is going up by >>>>allowing N. As >>>> for memory usage, creating more queues means more memory. That's a >>>> trade-off that userspace can make. Again, the expected number >>>>here is 1 >>>> or 2 in the vast majority of cases so I don't think you need to worry. >>> >>>Ok, will start with n=3 meaning 8 queues. >>>That would require us create 8 workqueues. >>>We can change 'n' later if required. >>> >>>Niranjana >>> >>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>>operations and we >>>> don't >>>> > want any dependencies between them: >>>> > 1. Immediate. These happen right after BO creation or >>>>maybe as >>>> part of >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >>>>don't happen >>>> on a >>>> > queue and we don't want them serialized with anything. To >>>> synchronize >>>> > with submit, we'll have a syncobj in the VkDevice which is >>>> signaled by >>>> > all immediate bind operations and make submits wait on it. >>>> > 2. Queued (sparse): These happen on a VkQueue which may be the >>>> same as >>>> > a render/compute queue or may be its own queue. It's up to us >>>> what we >>>> > want to advertise. From the Vulkan API PoV, this is like any >>>> other >>>> > queue. Operations on it wait on and signal semaphores. If we >>>> have a >>>> > VM_BIND engine, we'd provide syncobjs to wait and >>>>signal just like >>>> we do >>>> > in execbuf(). >>>> > The important thing is that we don't want one type of >>>>operation to >>>> block >>>> > on the other. If immediate binds are blocking on sparse binds, >>>> it's >>>> > going to cause over-synchronization issues. >>>> > In terms of the internal implementation, I know that >>>>there's going >>>> to be >>>> > a lock on the VM and that we can't actually do these things in >>>> > parallel. That's fine. Once the dma_fences have signaled and >>>> we're >>>> >>>> Thats correct. It is like a single VM_BIND engine with >>>>multiple queues >>>> feeding to it. >>>> >>>> Right. As long as the queues themselves are independent and >>>>can block on >>>> dma_fences without holding up other queues, I think we're fine. >>>> >>>> > unblocked to do the bind operation, I don't care if >>>>there's a bit >>>> of >>>> > synchronization due to locking. That's expected. What >>>>we can't >>>> afford >>>> > to have is an immediate bind operation suddenly blocking on a >>>> sparse >>>> > operation which is blocked on a compute job that's going to run >>>> for >>>> > another 5ms. >>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the >>>> VM_BIND >>>> on other VMs. I am not sure about usecases here, but just wanted to >>>> clarify. >>>> >>>> Yes, that's what I would expect. >>>> --Jason >>>> >>>> Niranjana >>>> >>>> > For reference, Windows solves this by allowing arbitrarily many >>>> paging >>>> > queues (what they call a VM_BIND engine/queue). That >>>>design works >>>> > pretty well and solves the problems in question. >>>>Again, we could >>>> just >>>> > make everything out-of-order and require using syncobjs >>>>to order >>>> things >>>> > as userspace wants. That'd be fine too. >>>> > One more note while I'm here: danvet said something on >>>>IRC about >>>> VM_BIND >>>> > queues waiting for syncobjs to materialize. We don't really >>>> want/need >>>> > this. We already have all the machinery in userspace to handle >>>> > wait-before-signal and waiting for syncobj fences to >>>>materialize >>>> and >>>> > that machinery is on by default. It would actually >>>>take MORE work >>>> in >>>> > Mesa to turn it off and take advantage of the kernel >>>>being able to >>>> wait >>>> > for syncobjs to materialize. Also, getting that right is >>>> ridiculously >>>> > hard and I really don't want to get it wrong in kernel >>>>space. When we >>>> > do memory fences, wait-before-signal will be a thing. We don't >>>> need to >>>> > try and make it a thing for syncobj. >>>> > --Jason >>>> > >>>> > Thanks Jason, >>>> > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a >>>> sparse >>>> > queue that does not implement either graphics or compute >>>>operations >>>> : >>>> > >>>> > "While some implementations may include >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>> > support in queue families that also include >>>> > >>>> > graphics and compute support, other implementations may only >>>> expose a >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>> > >>>> > family." >>>> > >>>> > So it can all be all a vm_bind engine that just does bind/unbind >>>> > operations. >>>> > >>>> > But yes we need another engine for the immediate/non-sparse >>>> operations. >>>> > >>>> > -Lionel >>>> > >>>> > > >>>> > Daniel, any thoughts? >>>> > >>>> > Niranjana >>>> > >>>> > >Matt >>>> > > >>>> > >> >>>> > >> Sorry I noticed this late. >>>> > >> >>>> > >> >>>> > >> -Lionel >>>> > >> >>>> > >>
On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
On 09/06/2022 00:55, Jason Ekstrand wrote:
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote: > > >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote: >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote: >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura >>>> niranjana.vishwanathapura@intel.com wrote: >>>> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote: >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: >>>> > >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura >>>> > niranjana.vishwanathapura@intel.com wrote: >>>> > >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew >>>>Brost wrote: >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin >>>> wrote: >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start >>>> binding/unbinding >>>> > the mapping in an >>>> > >> > +async worker. The binding and unbinding will >>>>work like a >>>> special >>>> > GPU engine. >>>> > >> > +The binding and unbinding operations are serialized and >>>> will >>>> > wait on specified >>>> > >> > +input fences before the operation and will signal the >>>> output >>>> > fences upon the >>>> > >> > +completion of the operation. Due to serialization, >>>> completion of >>>> > an operation >>>> > >> > +will also indicate that all previous operations >>>>are also >>>> > complete. >>>> > >> >>>> > >> I guess we should avoid saying "will immediately start >>>> > binding/unbinding" if >>>> > >> there are fences involved. >>>> > >> >>>> > >> And the fact that it's happening in an async >>>>worker seem to >>>> imply >>>> > it's not >>>> > >> immediate. >>>> > >> >>>> > >>>> > Ok, will fix. >>>> > This was added because in earlier design binding was deferred >>>> until >>>> > next execbuff. >>>> > But now it is non-deferred (immediate in that sense). >>>>But yah, >>>> this is >>>> > confusing >>>> > and will fix it. >>>> > >>>> > >> >>>> > >> I have a question on the behavior of the bind >>>>operation when >>>> no >>>> > input fence >>>> > >> is provided. Let say I do : >>>> > >> >>>> > >> VM_BIND (out_fence=fence1) >>>> > >> >>>> > >> VM_BIND (out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (out_fence=fence3) >>>> > >> >>>> > >> >>>> > >> In what order are the fences going to be signaled? >>>> > >> >>>> > >> In the order of VM_BIND ioctls? Or out of order? >>>> > >> >>>> > >> Because you wrote "serialized I assume it's : in order >>>> > >> >>>> > >>>> > Yes, in the order of VM_BIND/UNBIND ioctls. Note that >>>>bind and >>>> unbind >>>> > will use >>>> > the same queue and hence are ordered. >>>> > >>>> > >> >>>> > >> One thing I didn't realize is that because we only get one >>>> > "VM_BIND" engine, >>>> > >> there is a disconnect from the Vulkan specification. >>>> > >> >>>> > >> In Vulkan VM_BIND operations are serialized but >>>>per engine. >>>> > >> >>>> > >> So you could have something like this : >>>> > >> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) >>>> > >> >>>> > >> >>>> > >> fence1 is not signaled >>>> > >> >>>> > >> fence3 is signaled >>>> > >> >>>> > >> So the second VM_BIND will proceed before the >>>>first VM_BIND. >>>> > >> >>>> > >> >>>> > >> I guess we can deal with that scenario in >>>>userspace by doing >>>> the >>>> > wait >>>> > >> ourselves in one thread per engines. >>>> > >> >>>> > >> But then it makes the VM_BIND input fences useless. >>>> > >> >>>> > >> >>>> > >> Daniel : what do you think? Should be rework this or just >>>> deal with >>>> > wait >>>> > >> fences in userspace? >>>> > >> >>>> > > >>>> > >My opinion is rework this but make the ordering via >>>>an engine >>>> param >>>> > optional. >>>> > > >>>> > >e.g. A VM can be configured so all binds are ordered >>>>within the >>>> VM >>>> > > >>>> > >e.g. A VM can be configured so all binds accept an engine >>>> argument >>>> > (in >>>> > >the case of the i915 likely this is a gem context >>>>handle) and >>>> binds >>>> > >ordered with respect to that engine. >>>> > > >>>> > >This gives UMDs options as the later likely consumes >>>>more KMD >>>> > resources >>>> > >so if a different UMD can live with binds being >>>>ordered within >>>> the VM >>>> > >they can use a mode consuming less resources. >>>> > > >>>> > >>>> > I think we need to be careful here if we are looking for some >>>> out of >>>> > (submission) order completion of vm_bind/unbind. >>>> > In-order completion means, in a batch of binds and >>>>unbinds to be >>>> > completed in-order, user only needs to specify >>>>in-fence for the >>>> > first bind/unbind call and the our-fence for the last >>>> bind/unbind >>>> > call. Also, the VA released by an unbind call can be >>>>re-used by >>>> > any subsequent bind call in that in-order batch. >>>> > >>>> > These things will break if binding/unbinding were to >>>>be allowed >>>> to >>>> > go out of order (of submission) and user need to be extra >>>> careful >>>> > not to run into pre-mature triggereing of out-fence and bind >>>> failing >>>> > as VA is still in use etc. >>>> > >>>> > Also, VM_BIND binds the provided mapping on the specified >>>> address >>>> > space >>>> > (VM). So, the uapi is not engine/context specific. >>>> > >>>> > We can however add a 'queue' to the uapi which can be >>>>one from >>>> the >>>> > pre-defined queues, >>>> > I915_VM_BIND_QUEUE_0 >>>> > I915_VM_BIND_QUEUE_1 >>>> > ... >>>> > I915_VM_BIND_QUEUE_(N-1) >>>> > >>>> > KMD will spawn an async work queue for each queue which will >>>> only >>>> > bind the mappings on that queue in the order of submission. >>>> > User can assign the queue to per engine or anything >>>>like that. >>>> > >>>> > But again here, user need to be careful and not >>>>deadlock these >>>> > queues with circular dependency of fences. >>>> > >>>> > I prefer adding this later an as extension based on >>>>whether it >>>> > is really helping with the implementation. >>>> > >>>> > I can tell you right now that having everything on a single >>>> in-order >>>> > queue will not get us the perf we want. What vulkan >>>>really wants >>>> is one >>>> > of two things: >>>> > 1. No implicit ordering of VM_BIND ops. They just happen in >>>> whatever >>>> > their dependencies are resolved and we ensure ordering >>>>ourselves >>>> by >>>> > having a syncobj in the VkQueue. >>>> > 2. The ability to create multiple VM_BIND queues. We need at >>>> least 2 >>>> > but I don't see why there needs to be a limit besides >>>>the limits >>>> the >>>> > i915 API already has on the number of engines. Vulkan could >>>> expose >>>> > multiple sparse binding queues to the client if it's not >>>> arbitrarily >>>> > limited. >>>> >>>> Thanks Jason, Lionel. >>>> >>>> Jason, what are you referring to when you say "limits the i915 API >>>> already >>>> has on the number of engines"? I am not sure if there is such an uapi >>>> today. >>>> >>>> There's a limit of something like 64 total engines today based on the >>>> number of bits we can cram into the exec flags in execbuffer2. I think >>>> someone had an extended version that allowed more but I ripped it out >>>> because no one was using it. Of course, execbuffer3 might not >>>>have that >>>> problem at all. >>>> >>> >>>Thanks Jason. >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably >>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE >>>and somehow export it to user (I am thinking of embedding it in >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n >>>queues. >> >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
Yup! That's exactly the limit I was talking about.
>>will also have. So, we can simply define in vm_bind/unbind structures, >> >>#define I915_VM_BIND_MAX_QUEUE 64 >> __u32 queue; >> >>I think that will keep things simple. > >Hmmm? What does execbuf2 limit has to do with how many engines >hardware can have? I suggest not to do that. > >Change with added this: > > if (set.num_engines > I915_EXEC_RING_MASK + 1) > return -EINVAL; > >To context creation needs to be undone and so let users create engine >maps with all hardware engines, and let execbuf3 access them all. >
Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also. Hence, I was using the same limit for VM_BIND queues (64, or 65 if we make it N+1). But, as discussed in other thread of this RFC series, we are planning to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be any uapi that limits the number of engines (and hence the vm_bind queues need to be supported).
If we leave the number of vm_bind queues to be arbitrarily large (__u32 queue_idx) then, we need to have a hashmap for queue (a wq, work_item and a linked list) lookup from the user specified queue index. Other option is to just put some hard limit (say 64 or 65) and use an array of queues in VM (each created upon first use). I prefer this.
I don't get why a VM_BIND queue is any different from any other queue or userspace-visible kernel object. But I'll leave those details up to danvet or whoever else might be reviewing the implementation. --Jason
I kind of agree here. Wouldn't be simpler to have the bind queue created like the others when we build the engine map?
For userspace it's then just matter of selecting the right queue ID when submitting.
If there is ever a possibility to have this work on the GPU, it would be all ready.
I did sync offline with Matt Brost on this. We can add a VM_BIND engine class and let user create VM_BIND engines (queues). The problem is, in i915 engine creating interface is bound to gem_context. So, in vm_bind ioctl, we would need both context_id and queue_idx for proper lookup of the user created engine. This is bit ackward as vm_bind is an interface to VM (address space) and has nothing to do with gem_context.
A gem_context has a single vm object right?
Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.
So it's just like picking up the vm like it's done at execbuffer time right now : eb->context->vm
Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained from the context? I think the interface is clean as a interface to VM. It is only that we don't have a clean way to create a raw VM_BIND engine (not associated with any context) with i915 uapi. May be we can add such an interface, but I don't think that is worth it (we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned above). Anyone has any thoughts?
Another problem is, if two VMs are binding with the same defined engine, binding on VM1 can get unnecessary blocked by binding on VM2 (which may be waiting on its in_fence).
Maybe I'm missing something, but how can you have 2 vm objects with a single gem_context right now?
No, we don't have 2 VMs for a gem_context. Say if ctx1 with vm1 and ctx2 with vm2. First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If those two queue indicies points to same underlying vm_bind engine, then the second vm_bind call gets blocked until the first vm_bind call's 'in' fence is triggered and bind completes.
With per VM queues, this is not a problem as two VMs will not endup sharing same queue.
BTW, I just posted a updated PATCH series. https://www.spinics.net/lists/dri-devel/msg350483.html
Niranjana
So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind ioctl, and the queues are per VM.
Niranjana
Thanks,
-Lionel
Niranjana
>Regards, > >Tvrtko > >> >>Niranjana >> >>> >>>> I am trying to see how many queues we need and don't want it to be >>>> arbitrarily >>>> large and unduely blow up memory usage and complexity in i915 driver. >>>> >>>> I expect a Vulkan driver to use at most 2 in the vast majority >>>>of cases. I >>>> could imagine a client wanting to create more than 1 sparse >>>>queue in which >>>> case, it'll be N+1 but that's unlikely. As far as complexity >>>>goes, once >>>> you allow two, I don't think the complexity is going up by >>>>allowing N. As >>>> for memory usage, creating more queues means more memory. That's a >>>> trade-off that userspace can make. Again, the expected number >>>>here is 1 >>>> or 2 in the vast majority of cases so I don't think you need to worry. >>> >>>Ok, will start with n=3 meaning 8 queues. >>>That would require us create 8 workqueues. >>>We can change 'n' later if required. >>> >>>Niranjana >>> >>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>>operations and we >>>> don't >>>> > want any dependencies between them: >>>> > 1. Immediate. These happen right after BO creation or >>>>maybe as >>>> part of >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >>>>don't happen >>>> on a >>>> > queue and we don't want them serialized with anything. To >>>> synchronize >>>> > with submit, we'll have a syncobj in the VkDevice which is >>>> signaled by >>>> > all immediate bind operations and make submits wait on it. >>>> > 2. Queued (sparse): These happen on a VkQueue which may be the >>>> same as >>>> > a render/compute queue or may be its own queue. It's up to us >>>> what we >>>> > want to advertise. From the Vulkan API PoV, this is like any >>>> other >>>> > queue. Operations on it wait on and signal semaphores. If we >>>> have a >>>> > VM_BIND engine, we'd provide syncobjs to wait and >>>>signal just like >>>> we do >>>> > in execbuf(). >>>> > The important thing is that we don't want one type of >>>>operation to >>>> block >>>> > on the other. If immediate binds are blocking on sparse binds, >>>> it's >>>> > going to cause over-synchronization issues. >>>> > In terms of the internal implementation, I know that >>>>there's going >>>> to be >>>> > a lock on the VM and that we can't actually do these things in >>>> > parallel. That's fine. Once the dma_fences have signaled and >>>> we're >>>> >>>> Thats correct. It is like a single VM_BIND engine with >>>>multiple queues >>>> feeding to it. >>>> >>>> Right. As long as the queues themselves are independent and >>>>can block on >>>> dma_fences without holding up other queues, I think we're fine. >>>> >>>> > unblocked to do the bind operation, I don't care if >>>>there's a bit >>>> of >>>> > synchronization due to locking. That's expected. What >>>>we can't >>>> afford >>>> > to have is an immediate bind operation suddenly blocking on a >>>> sparse >>>> > operation which is blocked on a compute job that's going to run >>>> for >>>> > another 5ms. >>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the >>>> VM_BIND >>>> on other VMs. I am not sure about usecases here, but just wanted to >>>> clarify. >>>> >>>> Yes, that's what I would expect. >>>> --Jason >>>> >>>> Niranjana >>>> >>>> > For reference, Windows solves this by allowing arbitrarily many >>>> paging >>>> > queues (what they call a VM_BIND engine/queue). That >>>>design works >>>> > pretty well and solves the problems in question. >>>>Again, we could >>>> just >>>> > make everything out-of-order and require using syncobjs >>>>to order >>>> things >>>> > as userspace wants. That'd be fine too. >>>> > One more note while I'm here: danvet said something on >>>>IRC about >>>> VM_BIND >>>> > queues waiting for syncobjs to materialize. We don't really >>>> want/need >>>> > this. We already have all the machinery in userspace to handle >>>> > wait-before-signal and waiting for syncobj fences to >>>>materialize >>>> and >>>> > that machinery is on by default. It would actually >>>>take MORE work >>>> in >>>> > Mesa to turn it off and take advantage of the kernel >>>>being able to >>>> wait >>>> > for syncobjs to materialize. Also, getting that right is >>>> ridiculously >>>> > hard and I really don't want to get it wrong in kernel >>>>space. When we >>>> > do memory fences, wait-before-signal will be a thing. We don't >>>> need to >>>> > try and make it a thing for syncobj. >>>> > --Jason >>>> > >>>> > Thanks Jason, >>>> > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a >>>> sparse >>>> > queue that does not implement either graphics or compute >>>>operations >>>> : >>>> > >>>> > "While some implementations may include >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>> > support in queue families that also include >>>> > >>>> > graphics and compute support, other implementations may only >>>> expose a >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>> > >>>> > family." >>>> > >>>> > So it can all be all a vm_bind engine that just does bind/unbind >>>> > operations. >>>> > >>>> > But yes we need another engine for the immediate/non-sparse >>>> operations. >>>> > >>>> > -Lionel >>>> > >>>> > > >>>> > Daniel, any thoughts? >>>> > >>>> > Niranjana >>>> > >>>> > >Matt >>>> > > >>>> > >> >>>> > >> Sorry I noticed this late. >>>> > >> >>>> > >> >>>> > >> -Lionel >>>> > >> >>>> > >>
On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
On 09/06/2022 00:55, Jason Ekstrand wrote:
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote: > > >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote: >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote: >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura >>>> niranjana.vishwanathapura@intel.com wrote: >>>> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote: >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: >>>> > >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura >>>> > niranjana.vishwanathapura@intel.com wrote: >>>> > >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew >>>>Brost wrote: >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin >>>> wrote: >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start >>>> binding/unbinding >>>> > the mapping in an >>>> > >> > +async worker. The binding and unbinding will >>>>work like a >>>> special >>>> > GPU engine. >>>> > >> > +The binding and unbinding operations are serialized and >>>> will >>>> > wait on specified >>>> > >> > +input fences before the operation and will signal the >>>> output >>>> > fences upon the >>>> > >> > +completion of the operation. Due to serialization, >>>> completion of >>>> > an operation >>>> > >> > +will also indicate that all previous operations >>>>are also >>>> > complete. >>>> > >> >>>> > >> I guess we should avoid saying "will immediately start >>>> > binding/unbinding" if >>>> > >> there are fences involved. >>>> > >> >>>> > >> And the fact that it's happening in an async >>>>worker seem to >>>> imply >>>> > it's not >>>> > >> immediate. >>>> > >> >>>> > >>>> > Ok, will fix. >>>> > This was added because in earlier design binding was deferred >>>> until >>>> > next execbuff. >>>> > But now it is non-deferred (immediate in that sense). >>>>But yah, >>>> this is >>>> > confusing >>>> > and will fix it. >>>> > >>>> > >> >>>> > >> I have a question on the behavior of the bind >>>>operation when >>>> no >>>> > input fence >>>> > >> is provided. Let say I do : >>>> > >> >>>> > >> VM_BIND (out_fence=fence1) >>>> > >> >>>> > >> VM_BIND (out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (out_fence=fence3) >>>> > >> >>>> > >> >>>> > >> In what order are the fences going to be signaled? >>>> > >> >>>> > >> In the order of VM_BIND ioctls? Or out of order? >>>> > >> >>>> > >> Because you wrote "serialized I assume it's : in order >>>> > >> >>>> > >>>> > Yes, in the order of VM_BIND/UNBIND ioctls. Note that >>>>bind and >>>> unbind >>>> > will use >>>> > the same queue and hence are ordered. >>>> > >>>> > >> >>>> > >> One thing I didn't realize is that because we only get one >>>> > "VM_BIND" engine, >>>> > >> there is a disconnect from the Vulkan specification. >>>> > >> >>>> > >> In Vulkan VM_BIND operations are serialized but >>>>per engine. >>>> > >> >>>> > >> So you could have something like this : >>>> > >> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) >>>> > >> >>>> > >> >>>> > >> fence1 is not signaled >>>> > >> >>>> > >> fence3 is signaled >>>> > >> >>>> > >> So the second VM_BIND will proceed before the >>>>first VM_BIND. >>>> > >> >>>> > >> >>>> > >> I guess we can deal with that scenario in >>>>userspace by doing >>>> the >>>> > wait >>>> > >> ourselves in one thread per engines. >>>> > >> >>>> > >> But then it makes the VM_BIND input fences useless. >>>> > >> >>>> > >> >>>> > >> Daniel : what do you think? Should be rework this or just >>>> deal with >>>> > wait >>>> > >> fences in userspace? >>>> > >> >>>> > > >>>> > >My opinion is rework this but make the ordering via >>>>an engine >>>> param >>>> > optional. >>>> > > >>>> > >e.g. A VM can be configured so all binds are ordered >>>>within the >>>> VM >>>> > > >>>> > >e.g. A VM can be configured so all binds accept an engine >>>> argument >>>> > (in >>>> > >the case of the i915 likely this is a gem context >>>>handle) and >>>> binds >>>> > >ordered with respect to that engine. >>>> > > >>>> > >This gives UMDs options as the later likely consumes >>>>more KMD >>>> > resources >>>> > >so if a different UMD can live with binds being >>>>ordered within >>>> the VM >>>> > >they can use a mode consuming less resources. >>>> > > >>>> > >>>> > I think we need to be careful here if we are looking for some >>>> out of >>>> > (submission) order completion of vm_bind/unbind. >>>> > In-order completion means, in a batch of binds and >>>>unbinds to be >>>> > completed in-order, user only needs to specify >>>>in-fence for the >>>> > first bind/unbind call and the our-fence for the last >>>> bind/unbind >>>> > call. Also, the VA released by an unbind call can be >>>>re-used by >>>> > any subsequent bind call in that in-order batch. >>>> > >>>> > These things will break if binding/unbinding were to >>>>be allowed >>>> to >>>> > go out of order (of submission) and user need to be extra >>>> careful >>>> > not to run into pre-mature triggereing of out-fence and bind >>>> failing >>>> > as VA is still in use etc. >>>> > >>>> > Also, VM_BIND binds the provided mapping on the specified >>>> address >>>> > space >>>> > (VM). So, the uapi is not engine/context specific. >>>> > >>>> > We can however add a 'queue' to the uapi which can be >>>>one from >>>> the >>>> > pre-defined queues, >>>> > I915_VM_BIND_QUEUE_0 >>>> > I915_VM_BIND_QUEUE_1 >>>> > ... >>>> > I915_VM_BIND_QUEUE_(N-1) >>>> > >>>> > KMD will spawn an async work queue for each queue which will >>>> only >>>> > bind the mappings on that queue in the order of submission. >>>> > User can assign the queue to per engine or anything >>>>like that. >>>> > >>>> > But again here, user need to be careful and not >>>>deadlock these >>>> > queues with circular dependency of fences. >>>> > >>>> > I prefer adding this later an as extension based on >>>>whether it >>>> > is really helping with the implementation. >>>> > >>>> > I can tell you right now that having everything on a single >>>> in-order >>>> > queue will not get us the perf we want. What vulkan >>>>really wants >>>> is one >>>> > of two things: >>>> > 1. No implicit ordering of VM_BIND ops. They just happen in >>>> whatever >>>> > their dependencies are resolved and we ensure ordering >>>>ourselves >>>> by >>>> > having a syncobj in the VkQueue. >>>> > 2. The ability to create multiple VM_BIND queues. We need at >>>> least 2 >>>> > but I don't see why there needs to be a limit besides >>>>the limits >>>> the >>>> > i915 API already has on the number of engines. Vulkan could >>>> expose >>>> > multiple sparse binding queues to the client if it's not >>>> arbitrarily >>>> > limited. >>>> >>>> Thanks Jason, Lionel. >>>> >>>> Jason, what are you referring to when you say "limits the i915 API >>>> already >>>> has on the number of engines"? I am not sure if there is such an uapi >>>> today. >>>> >>>> There's a limit of something like 64 total engines today based on the >>>> number of bits we can cram into the exec flags in execbuffer2. I think >>>> someone had an extended version that allowed more but I ripped it out >>>> because no one was using it. Of course, execbuffer3 might not >>>>have that >>>> problem at all. >>>> >>> >>>Thanks Jason. >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably >>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE >>>and somehow export it to user (I am thinking of embedding it in >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n >>>queues. >> >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
Yup! That's exactly the limit I was talking about.
>>will also have. So, we can simply define in vm_bind/unbind structures, >> >>#define I915_VM_BIND_MAX_QUEUE 64 >> __u32 queue; >> >>I think that will keep things simple. > >Hmmm? What does execbuf2 limit has to do with how many engines >hardware can have? I suggest not to do that. > >Change with added this: > > if (set.num_engines > I915_EXEC_RING_MASK + 1) > return -EINVAL; > >To context creation needs to be undone and so let users create engine >maps with all hardware engines, and let execbuf3 access them all. >
Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also. Hence, I was using the same limit for VM_BIND queues (64, or 65 if we make it N+1). But, as discussed in other thread of this RFC series, we are planning to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be any uapi that limits the number of engines (and hence the vm_bind queues need to be supported).
If we leave the number of vm_bind queues to be arbitrarily large (__u32 queue_idx) then, we need to have a hashmap for queue (a wq, work_item and a linked list) lookup from the user specified queue index. Other option is to just put some hard limit (say 64 or 65) and use an array of queues in VM (each created upon first use). I prefer this.
I don't get why a VM_BIND queue is any different from any other queue or userspace-visible kernel object. But I'll leave those details up to danvet or whoever else might be reviewing the implementation. --Jason
I kind of agree here. Wouldn't be simpler to have the bind queue created like the others when we build the engine map?
For userspace it's then just matter of selecting the right queue ID when submitting.
If there is ever a possibility to have this work on the GPU, it would be all ready.
I did sync offline with Matt Brost on this. We can add a VM_BIND engine class and let user create VM_BIND engines (queues). The problem is, in i915 engine creating interface is bound to gem_context. So, in vm_bind ioctl, we would need both context_id and queue_idx for proper lookup of the user created engine. This is bit ackward as vm_bind is an interface to VM (address space) and has nothing to do with gem_context.
A gem_context has a single vm object right?
Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.
So it's just like picking up the vm like it's done at execbuffer time right now : eb->context->vm
Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained from the context?
Yes, because if we go for engines, they're associated with a context and so also associated with the VM bound to the context.
I think the interface is clean as a interface to VM. It is only that we don't have a clean way to create a raw VM_BIND engine (not associated with any context) with i915 uapi. May be we can add such an interface, but I don't think that is worth it (we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned above). Anyone has any thoughts?
Another problem is, if two VMs are binding with the same defined engine, binding on VM1 can get unnecessary blocked by binding on VM2 (which may be waiting on its in_fence).
Maybe I'm missing something, but how can you have 2 vm objects with a single gem_context right now?
No, we don't have 2 VMs for a gem_context. Say if ctx1 with vm1 and ctx2 with vm2. First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If those two queue indicies points to same underlying vm_bind engine, then the second vm_bind call gets blocked until the first vm_bind call's 'in' fence is triggered and bind completes.
With per VM queues, this is not a problem as two VMs will not endup sharing same queue.
BTW, I just posted a updated PATCH series. https://www.spinics.net/lists/dri-devel/msg350483.html
Niranjana
So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind ioctl, and the queues are per VM.
Niranjana
Thanks,
-Lionel
Niranjana
>Regards, > >Tvrtko > >> >>Niranjana >> >>> >>>> I am trying to see how many queues we need and don't want it to be >>>> arbitrarily >>>> large and unduely blow up memory usage and complexity in i915 driver. >>>> >>>> I expect a Vulkan driver to use at most 2 in the vast majority >>>>of cases. I >>>> could imagine a client wanting to create more than 1 sparse >>>>queue in which >>>> case, it'll be N+1 but that's unlikely. As far as complexity >>>>goes, once >>>> you allow two, I don't think the complexity is going up by >>>>allowing N. As >>>> for memory usage, creating more queues means more memory. That's a >>>> trade-off that userspace can make. Again, the expected number >>>>here is 1 >>>> or 2 in the vast majority of cases so I don't think you need to worry. >>> >>>Ok, will start with n=3 meaning 8 queues. >>>That would require us create 8 workqueues. >>>We can change 'n' later if required. >>> >>>Niranjana >>> >>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>>operations and we >>>> don't >>>> > want any dependencies between them: >>>> > 1. Immediate. These happen right after BO creation or >>>>maybe as >>>> part of >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >>>>don't happen >>>> on a >>>> > queue and we don't want them serialized with anything. To >>>> synchronize >>>> > with submit, we'll have a syncobj in the VkDevice which is >>>> signaled by >>>> > all immediate bind operations and make submits wait on it. >>>> > 2. Queued (sparse): These happen on a VkQueue which may be the >>>> same as >>>> > a render/compute queue or may be its own queue. It's up to us >>>> what we >>>> > want to advertise. From the Vulkan API PoV, this is like any >>>> other >>>> > queue. Operations on it wait on and signal semaphores. If we >>>> have a >>>> > VM_BIND engine, we'd provide syncobjs to wait and >>>>signal just like >>>> we do >>>> > in execbuf(). >>>> > The important thing is that we don't want one type of >>>>operation to >>>> block >>>> > on the other. If immediate binds are blocking on sparse binds, >>>> it's >>>> > going to cause over-synchronization issues. >>>> > In terms of the internal implementation, I know that >>>>there's going >>>> to be >>>> > a lock on the VM and that we can't actually do these things in >>>> > parallel. That's fine. Once the dma_fences have signaled and >>>> we're >>>> >>>> Thats correct. It is like a single VM_BIND engine with >>>>multiple queues >>>> feeding to it. >>>> >>>> Right. As long as the queues themselves are independent and >>>>can block on >>>> dma_fences without holding up other queues, I think we're fine. >>>> >>>> > unblocked to do the bind operation, I don't care if >>>>there's a bit >>>> of >>>> > synchronization due to locking. That's expected. What >>>>we can't >>>> afford >>>> > to have is an immediate bind operation suddenly blocking on a >>>> sparse >>>> > operation which is blocked on a compute job that's going to run >>>> for >>>> > another 5ms. >>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the >>>> VM_BIND >>>> on other VMs. I am not sure about usecases here, but just wanted to >>>> clarify. >>>> >>>> Yes, that's what I would expect. >>>> --Jason >>>> >>>> Niranjana >>>> >>>> > For reference, Windows solves this by allowing arbitrarily many >>>> paging >>>> > queues (what they call a VM_BIND engine/queue). That >>>>design works >>>> > pretty well and solves the problems in question. >>>>Again, we could >>>> just >>>> > make everything out-of-order and require using syncobjs >>>>to order >>>> things >>>> > as userspace wants. That'd be fine too. >>>> > One more note while I'm here: danvet said something on >>>>IRC about >>>> VM_BIND >>>> > queues waiting for syncobjs to materialize. We don't really >>>> want/need >>>> > this. We already have all the machinery in userspace to handle >>>> > wait-before-signal and waiting for syncobj fences to >>>>materialize >>>> and >>>> > that machinery is on by default. It would actually >>>>take MORE work >>>> in >>>> > Mesa to turn it off and take advantage of the kernel >>>>being able to >>>> wait >>>> > for syncobjs to materialize. Also, getting that right is >>>> ridiculously >>>> > hard and I really don't want to get it wrong in kernel >>>>space. �� When we >>>> > do memory fences, wait-before-signal will be a thing. We don't >>>> need to >>>> > try and make it a thing for syncobj. >>>> > --Jason >>>> > >>>> > Thanks Jason, >>>> > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a >>>> sparse >>>> > queue that does not implement either graphics or compute >>>>operations >>>> : >>>> > >>>> > "While some implementations may include >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>> > support in queue families that also include >>>> > >>>> > graphics and compute support, other implementations may only >>>> expose a >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>> > >>>> > family." >>>> > >>>> > So it can all be all a vm_bind engine that just does bind/unbind >>>> > operations. >>>> > >>>> > But yes we need another engine for the immediate/non-sparse >>>> operations. >>>> > >>>> > -Lionel >>>> > >>>> > > >>>> > Daniel, any thoughts? >>>> > >>>> > Niranjana >>>> > >>>> > >Matt >>>> > > >>>> > >> >>>> > >> Sorry I noticed this late. >>>> > >> >>>> > >> >>>> > >> -Lionel >>>> > >> >>>> > >>
On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
On 09/06/2022 00:55, Jason Ekstrand wrote:
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote: > > >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote: >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote: >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura >>>> niranjana.vishwanathapura@intel.com wrote: >>>> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote: >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: >>>> > >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura >>>> > niranjana.vishwanathapura@intel.com wrote: >>>> > >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew >>>>Brost wrote: >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin >>>> wrote: >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start >>>> binding/unbinding >>>> > the mapping in an >>>> > >> > +async worker. The binding and unbinding will >>>>work like a >>>> special >>>> > GPU engine. >>>> > >> > +The binding and unbinding operations are serialized and >>>> will >>>> > wait on specified >>>> > >> > +input fences before the operation and will signal the >>>> output >>>> > fences upon the >>>> > >> > +completion of the operation. Due to serialization, >>>> completion of >>>> > an operation >>>> > >> > +will also indicate that all previous operations >>>>are also >>>> > complete. >>>> > >> >>>> > >> I guess we should avoid saying "will immediately start >>>> > binding/unbinding" if >>>> > >> there are fences involved. >>>> > >> >>>> > >> And the fact that it's happening in an async >>>>worker seem to >>>> imply >>>> > it's not >>>> > >> immediate. >>>> > >> >>>> > >>>> > Ok, will fix. >>>> > This was added because in earlier design binding was deferred >>>> until >>>> > next execbuff. >>>> > But now it is non-deferred (immediate in that sense). >>>>But yah, >>>> this is >>>> > confusing >>>> > and will fix it. >>>> > >>>> > >> >>>> > >> I have a question on the behavior of the bind >>>>operation when >>>> no >>>> > input fence >>>> > >> is provided. Let say I do : >>>> > >> >>>> > >> VM_BIND (out_fence=fence1) >>>> > >> >>>> > >> VM_BIND (out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (out_fence=fence3) >>>> > >> >>>> > >> >>>> > >> In what order are the fences going to be signaled? >>>> > >> >>>> > >> In the order of VM_BIND ioctls? Or out of order? >>>> > >> >>>> > >> Because you wrote "serialized I assume it's : in order >>>> > >> >>>> > >>>> > Yes, in the order of VM_BIND/UNBIND ioctls. Note that >>>>bind and >>>> unbind >>>> > will use >>>> > the same queue and hence are ordered. >>>> > >>>> > >> >>>> > >> One thing I didn't realize is that because we only get one >>>> > "VM_BIND" engine, >>>> > >> there is a disconnect from the Vulkan specification. >>>> > >> >>>> > >> In Vulkan VM_BIND operations are serialized but >>>>per engine. >>>> > >> >>>> > >> So you could have something like this : >>>> > >> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) >>>> > >> >>>> > >> >>>> > >> fence1 is not signaled >>>> > >> >>>> > >> fence3 is signaled >>>> > >> >>>> > >> So the second VM_BIND will proceed before the >>>>first VM_BIND. >>>> > >> >>>> > >> >>>> > >> I guess we can deal with that scenario in >>>>userspace by doing >>>> the >>>> > wait >>>> > >> ourselves in one thread per engines. >>>> > >> >>>> > >> But then it makes the VM_BIND input fences useless. >>>> > >> >>>> > >> >>>> > >> Daniel : what do you think? Should be rework this or just >>>> deal with >>>> > wait >>>> > >> fences in userspace? >>>> > >> >>>> > > >>>> > >My opinion is rework this but make the ordering via >>>>an engine >>>> param >>>> > optional. >>>> > > >>>> > >e.g. A VM can be configured so all binds are ordered >>>>within the >>>> VM >>>> > > >>>> > >e.g. A VM can be configured so all binds accept an engine >>>> argument >>>> > (in >>>> > >the case of the i915 likely this is a gem context >>>>handle) and >>>> binds >>>> > >ordered with respect to that engine. >>>> > > >>>> > >This gives UMDs options as the later likely consumes >>>>more KMD >>>> > resources >>>> > >so if a different UMD can live with binds being >>>>ordered within >>>> the VM >>>> > >they can use a mode consuming less resources. >>>> > > >>>> > >>>> > I think we need to be careful here if we are looking for some >>>> out of >>>> > (submission) order completion of vm_bind/unbind. >>>> > In-order completion means, in a batch of binds and >>>>unbinds to be >>>> > completed in-order, user only needs to specify >>>>in-fence for the >>>> > first bind/unbind call and the our-fence for the last >>>> bind/unbind >>>> > call. Also, the VA released by an unbind call can be >>>>re-used by >>>> > any subsequent bind call in that in-order batch. >>>> > >>>> > These things will break if binding/unbinding were to >>>>be allowed >>>> to >>>> > go out of order (of submission) and user need to be extra >>>> careful >>>> > not to run into pre-mature triggereing of out-fence and bind >>>> failing >>>> > as VA is still in use etc. >>>> > >>>> > Also, VM_BIND binds the provided mapping on the specified >>>> address >>>> > space >>>> > (VM). So, the uapi is not engine/context specific. >>>> > >>>> > We can however add a 'queue' to the uapi which can be >>>>one from >>>> the >>>> > pre-defined queues, >>>> > I915_VM_BIND_QUEUE_0 >>>> > I915_VM_BIND_QUEUE_1 >>>> > ... >>>> > I915_VM_BIND_QUEUE_(N-1) >>>> > >>>> > KMD will spawn an async work queue for each queue which will >>>> only >>>> > bind the mappings on that queue in the order of submission. >>>> > User can assign the queue to per engine or anything >>>>like that. >>>> > >>>> > But again here, user need to be careful and not >>>>deadlock these >>>> > queues with circular dependency of fences. >>>> > >>>> > I prefer adding this later an as extension based on >>>>whether it >>>> > is really helping with the implementation. >>>> > >>>> > I can tell you right now that having everything on a single >>>> in-order >>>> > queue will not get us the perf we want. What vulkan >>>>really wants >>>> is one >>>> > of two things: >>>> > 1. No implicit ordering of VM_BIND ops. They just happen in >>>> whatever >>>> > their dependencies are resolved and we ensure ordering >>>>ourselves >>>> by >>>> > having a syncobj in the VkQueue. >>>> > 2. The ability to create multiple VM_BIND queues. We need at >>>> least 2 >>>> > but I don't see why there needs to be a limit besides >>>>the limits >>>> the >>>> > i915 API already has on the number of engines. Vulkan could >>>> expose >>>> > multiple sparse binding queues to the client if it's not >>>> arbitrarily >>>> > limited. >>>> >>>> Thanks Jason, Lionel. >>>> >>>> Jason, what are you referring to when you say "limits the i915 API >>>> already >>>> has on the number of engines"? I am not sure if there is such an uapi >>>> today. >>>> >>>> There's a limit of something like 64 total engines today based on the >>>> number of bits we can cram into the exec flags in execbuffer2. I think >>>> someone had an extended version that allowed more but I ripped it out >>>> because no one was using it. Of course, execbuffer3 might not >>>>have that >>>> problem at all. >>>> >>> >>>Thanks Jason. >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably >>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE >>>and somehow export it to user (I am thinking of embedding it in >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n >>>queues. >> >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
Yup! That's exactly the limit I was talking about.
>>will also have. So, we can simply define in vm_bind/unbind structures, >> >>#define I915_VM_BIND_MAX_QUEUE 64 >> __u32 queue; >> >>I think that will keep things simple. > >Hmmm? What does execbuf2 limit has to do with how many engines >hardware can have? I suggest not to do that. > >Change with added this: > > if (set.num_engines > I915_EXEC_RING_MASK + 1) > return -EINVAL; > >To context creation needs to be undone and so let users create engine >maps with all hardware engines, and let execbuf3 access them all. >
Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also. Hence, I was using the same limit for VM_BIND queues (64, or 65 if we make it N+1). But, as discussed in other thread of this RFC series, we are planning to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be any uapi that limits the number of engines (and hence the vm_bind queues need to be supported).
If we leave the number of vm_bind queues to be arbitrarily large (__u32 queue_idx) then, we need to have a hashmap for queue (a wq, work_item and a linked list) lookup from the user specified queue index. Other option is to just put some hard limit (say 64 or 65) and use an array of queues in VM (each created upon first use). I prefer this.
I don't get why a VM_BIND queue is any different from any other queue or userspace-visible kernel object. But I'll leave those details up to danvet or whoever else might be reviewing the implementation. --Jason
I kind of agree here. Wouldn't be simpler to have the bind queue created like the others when we build the engine map?
For userspace it's then just matter of selecting the right queue ID when submitting.
If there is ever a possibility to have this work on the GPU, it would be all ready.
I did sync offline with Matt Brost on this. We can add a VM_BIND engine class and let user create VM_BIND engines (queues). The problem is, in i915 engine creating interface is bound to gem_context. So, in vm_bind ioctl, we would need both context_id and queue_idx for proper lookup of the user created engine. This is bit ackward as vm_bind is an interface to VM (address space) and has nothing to do with gem_context.
A gem_context has a single vm object right?
Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.
So it's just like picking up the vm like it's done at execbuffer time right now : eb->context->vm
Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained from the context?
Yes, because if we go for engines, they're associated with a context and so also associated with the VM bound to the context.
Hmm...context doesn't sould like the right interface. It should be VM and engine (independent of context). Engine can be virtual or soft engine (kernel thread), each with its own queue. We can add an interface to create such engines (independent of context). But we are anway implicitly creating it when user uses a new queue_idx. If in future we have hardware engines for VM_BIND operation, we can have that explicit inteface to create engine instances and the queue_index in vm_bind/unbind will point to those engines. Anyone has any thoughts? Daniel?
Niranjana
I think the interface is clean as a interface to VM. It is only that we don't have a clean way to create a raw VM_BIND engine (not associated with any context) with i915 uapi. May be we can add such an interface, but I don't think that is worth it (we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned above). Anyone has any thoughts?
Another problem is, if two VMs are binding with the same defined engine, binding on VM1 can get unnecessary blocked by binding on VM2 (which may be waiting on its in_fence).
Maybe I'm missing something, but how can you have 2 vm objects with a single gem_context right now?
No, we don't have 2 VMs for a gem_context. Say if ctx1 with vm1 and ctx2 with vm2. First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If those two queue indicies points to same underlying vm_bind engine, then the second vm_bind call gets blocked until the first vm_bind call's 'in' fence is triggered and bind completes.
With per VM queues, this is not a problem as two VMs will not endup sharing same queue.
BTW, I just posted a updated PATCH series. https://www.spinics.net/lists/dri-devel/msg350483.html
Niranjana
So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind ioctl, and the queues are per VM.
Niranjana
Thanks,
-Lionel
Niranjana
>Regards, > >Tvrtko > >> >>Niranjana >> >>> >>>> I am trying to see how many queues we need and don't want it to be >>>> arbitrarily >>>> large and unduely blow up memory usage and complexity in i915 driver. >>>> >>>> I expect a Vulkan driver to use at most 2 in the vast majority >>>>of cases. I >>>> could imagine a client wanting to create more than 1 sparse >>>>queue in which >>>> case, it'll be N+1 but that's unlikely. As far as complexity >>>>goes, once >>>> you allow two, I don't think the complexity is going up by >>>>allowing N. As >>>> for memory usage, creating more queues means more memory. That's a >>>> trade-off that userspace can make. Again, the expected number >>>>here is 1 >>>> or 2 in the vast majority of cases so I don't think you need to worry. >>> >>>Ok, will start with n=3 meaning 8 queues. >>>That would require us create 8 workqueues. >>>We can change 'n' later if required. >>> >>>Niranjana >>> >>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>>operations and we >>>> don't >>>> > want any dependencies between them: >>>> > 1. Immediate. These happen right after BO creation or >>>>maybe as >>>> part of >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >>>>don't happen >>>> on a >>>> > queue and we don't want them serialized with anything. To >>>> synchronize >>>> > with submit, we'll have a syncobj in the VkDevice which is >>>> signaled by >>>> > all immediate bind operations and make submits wait on it. >>>> > 2. Queued (sparse): These happen on a VkQueue which may be the >>>> same as >>>> > a render/compute queue or may be its own queue. It's up to us >>>> what we >>>> > want to advertise. From the Vulkan API PoV, this is like any >>>> other >>>> > queue. Operations on it wait on and signal semaphores. If we >>>> have a >>>> > VM_BIND engine, we'd provide syncobjs to wait and >>>>signal just like >>>> we do >>>> > in execbuf(). >>>> > The important thing is that we don't want one type of >>>>operation to >>>> block >>>> > on the other. If immediate binds are blocking on sparse binds, >>>> it's >>>> > going to cause over-synchronization issues. >>>> > In terms of the internal implementation, I know that >>>>there's going >>>> to be >>>> > a lock on the VM and that we can't actually do these things in >>>> > parallel. That's fine. Once the dma_fences have signaled and >>>> we're >>>> >>>> Thats correct. It is like a single VM_BIND engine with >>>>multiple queues >>>> feeding to it. >>>> >>>> Right. As long as the queues themselves are independent and >>>>can block on >>>> dma_fences without holding up other queues, I think we're fine. >>>> >>>> > unblocked to do the bind operation, I don't care if >>>>there's a bit >>>> of >>>> > synchronization due to locking. That's expected. What >>>>we can't >>>> afford >>>> > to have is an immediate bind operation suddenly blocking on a >>>> sparse >>>> > operation which is blocked on a compute job that's going to run >>>> for >>>> > another 5ms. >>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the >>>> VM_BIND >>>> on other VMs. I am not sure about usecases here, but just wanted to >>>> clarify. >>>> >>>> Yes, that's what I would expect. >>>> --Jason >>>> >>>> Niranjana >>>> >>>> > For reference, Windows solves this by allowing arbitrarily many >>>> paging >>>> > queues (what they call a VM_BIND engine/queue). That >>>>design works >>>> > pretty well and solves the problems in question. >>>>Again, we could >>>> just >>>> > make everything out-of-order and require using syncobjs >>>>to order >>>> things >>>> > as userspace wants. That'd be fine too. >>>> > One more note while I'm here: danvet said something on >>>>IRC about >>>> VM_BIND >>>> > queues waiting for syncobjs to materialize. We don't really >>>> want/need >>>> > this. We already have all the machinery in userspace to handle >>>> > wait-before-signal and waiting for syncobj fences to >>>>materialize >>>> and >>>> > that machinery is on by default. It would actually >>>>take MORE work >>>> in >>>> > Mesa to turn it off and take advantage of the kernel >>>>being able to >>>> wait >>>> > for syncobjs to materialize. Also, getting that right is >>>> ridiculously >>>> > hard and I really don't want to get it wrong in kernel >>>>space. �� When we >>>> > do memory fences, wait-before-signal will be a thing. We don't >>>> need to >>>> > try and make it a thing for syncobj. >>>> > --Jason >>>> > >>>> > Thanks Jason, >>>> > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a >>>> sparse >>>> > queue that does not implement either graphics or compute >>>>operations >>>> : >>>> > >>>> > "While some implementations may include >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>> > support in queue families that also include >>>> > >>>> > graphics and compute support, other implementations may only >>>> expose a >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>> > >>>> > family." >>>> > >>>> > So it can all be all a vm_bind engine that just does bind/unbind >>>> > operations. >>>> > >>>> > But yes we need another engine for the immediate/non-sparse >>>> operations. >>>> > >>>> > -Lionel >>>> > >>>> > > >>>> > Daniel, any thoughts? >>>> > >>>> > Niranjana >>>> > >>>> > >Matt >>>> > > >>>> > >> >>>> > >> Sorry I noticed this late. >>>> > >> >>>> > >> >>>> > >> -Lionel >>>> > >> >>>> > >>
Regards, Oak
-----Original Message----- From: Intel-gfx intel-gfx-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: June 10, 2022 1:43 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Intel GFX intel-gfx@lists.freedesktop.org; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P chris.p.wilson@intel.com; Vetter, Daniel daniel.vetter@intel.com; Christian König christian.koenig@amd.com Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:
On 09/06/2022 00:55, Jason Ekstrand wrote:
On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote: > > >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote: >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote: >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura >>>> niranjana.vishwanathapura@intel.com wrote: >>>> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote: >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: >>>> > >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura >>>> > niranjana.vishwanathapura@intel.com wrote: >>>> > >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew >>>>Brost wrote: >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin >>>> wrote: >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote: >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start >>>> binding/unbinding >>>> > the mapping in an >>>> > >> > +async worker. The binding and unbinding will >>>>work like a >>>> special >>>> > GPU engine. >>>> > >> > +The binding and unbinding operations are serialized and >>>> will >>>> > wait on specified >>>> > >> > +input fences before the operation and will signal the >>>> output >>>> > fences upon the >>>> > >> > +completion of the operation. Due to serialization, >>>> completion of >>>> > an operation >>>> > >> > +will also indicate that all previous operations >>>>are also >>>> > complete. >>>> > >> >>>> > >> I guess we should avoid saying "will immediately start >>>> > binding/unbinding" if >>>> > >> there are fences involved. >>>> > >> >>>> > >> And the fact that it's happening in an async >>>>worker seem to >>>> imply >>>> > it's not >>>> > >> immediate. >>>> > >> >>>> > >>>> > Ok, will fix. >>>> > This was added because in earlier design binding was deferred >>>> until >>>> > next execbuff. >>>> > But now it is non-deferred (immediate in that sense). >>>>But yah, >>>> this is >>>> > confusing >>>> > and will fix it. >>>> > >>>> > >> >>>> > >> I have a question on the behavior of the bind >>>>operation when >>>> no >>>> > input fence >>>> > >> is provided. Let say I do : >>>> > >> >>>> > >> VM_BIND (out_fence=fence1) >>>> > >> >>>> > >> VM_BIND (out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (out_fence=fence3) >>>> > >> >>>> > >> >>>> > >> In what order are the fences going to be signaled? >>>> > >> >>>> > >> In the order of VM_BIND ioctls? Or out of order? >>>> > >> >>>> > >> Because you wrote "serialized I assume it's : in order >>>> > >> >>>> > >>>> > Yes, in the order of VM_BIND/UNBIND ioctls. Note that >>>>bind and >>>> unbind >>>> > will use >>>> > the same queue and hence are ordered. >>>> > >>>> > >> >>>> > >> One thing I didn't realize is that because we only get one >>>> > "VM_BIND" engine, >>>> > >> there is a disconnect from the Vulkan specification. >>>> > >> >>>> > >> In Vulkan VM_BIND operations are serialized but >>>>per engine. >>>> > >> >>>> > >> So you could have something like this : >>>> > >> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2) >>>> > >> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4) >>>> > >> >>>> > >> >>>> > >> fence1 is not signaled >>>> > >> >>>> > >> fence3 is signaled >>>> > >> >>>> > >> So the second VM_BIND will proceed before the >>>>first VM_BIND. >>>> > >> >>>> > >> >>>> > >> I guess we can deal with that scenario in >>>>userspace by doing >>>> the >>>> > wait >>>> > >> ourselves in one thread per engines. >>>> > >> >>>> > >> But then it makes the VM_BIND input fences useless. >>>> > >> >>>> > >> >>>> > >> Daniel : what do you think? Should be rework this or just >>>> deal with >>>> > wait >>>> > >> fences in userspace? >>>> > >> >>>> > > >>>> > >My opinion is rework this but make the ordering via >>>>an engine >>>> param >>>> > optional. >>>> > > >>>> > >e.g. A VM can be configured so all binds are ordered >>>>within the >>>> VM >>>> > > >>>> > >e.g. A VM can be configured so all binds accept an engine >>>> argument >>>> > (in >>>> > >the case of the i915 likely this is a gem context >>>>handle) and >>>> binds >>>> > >ordered with respect to that engine. >>>> > > >>>> > >This gives UMDs options as the later likely consumes >>>>more KMD >>>> > resources >>>> > >so if a different UMD can live with binds being >>>>ordered within >>>> the VM >>>> > >they can use a mode consuming less resources. >>>> > > >>>> > >>>> > I think we need to be careful here if we are looking for some >>>> out of >>>> > (submission) order completion of vm_bind/unbind. >>>> > In-order completion means, in a batch of binds and >>>>unbinds to be >>>> > completed in-order, user only needs to specify >>>>in-fence for the >>>> > first bind/unbind call and the our-fence for the last >>>> bind/unbind >>>> > call. Also, the VA released by an unbind call can be >>>>re-used by >>>> > any subsequent bind call in that in-order batch. >>>> > >>>> > These things will break if binding/unbinding were to >>>>be allowed >>>> to >>>> > go out of order (of submission) and user need to be extra >>>> careful >>>> > not to run into pre-mature triggereing of out-fence and bind >>>> failing >>>> > as VA is still in use etc. >>>> > >>>> > Also, VM_BIND binds the provided mapping on the specified >>>> address >>>> > space >>>> > (VM). So, the uapi is not engine/context specific. >>>> > >>>> > We can however add a 'queue' to the uapi which can be >>>>one from >>>> the >>>> > pre-defined queues, >>>> > I915_VM_BIND_QUEUE_0 >>>> > I915_VM_BIND_QUEUE_1 >>>> > ... >>>> > I915_VM_BIND_QUEUE_(N-1) >>>> > >>>> > KMD will spawn an async work queue for each queue which will >>>> only >>>> > bind the mappings on that queue in the order of submission. >>>> > User can assign the queue to per engine or anything >>>>like that. >>>> > >>>> > But again here, user need to be careful and not >>>>deadlock these >>>> > queues with circular dependency of fences. >>>> > >>>> > I prefer adding this later an as extension based on >>>>whether it >>>> > is really helping with the implementation. >>>> > >>>> > I can tell you right now that having everything on a single >>>> in-order >>>> > queue will not get us the perf we want. What vulkan >>>>really wants >>>> is one >>>> > of two things: >>>> > 1. No implicit ordering of VM_BIND ops. They just happen in >>>> whatever >>>> > their dependencies are resolved and we ensure ordering >>>>ourselves >>>> by >>>> > having a syncobj in the VkQueue. >>>> > 2. The ability to create multiple VM_BIND queues. We need at >>>> least 2 >>>> > but I don't see why there needs to be a limit besides >>>>the limits >>>> the >>>> > i915 API already has on the number of engines. Vulkan could >>>> expose >>>> > multiple sparse binding queues to the client if it's not >>>> arbitrarily >>>> > limited. >>>> >>>> Thanks Jason, Lionel. >>>> >>>> Jason, what are you referring to when you say "limits the i915 API >>>> already >>>> has on the number of engines"? I am not sure if there is such an uapi >>>> today. >>>> >>>> There's a limit of something like 64 total engines today based on the >>>> number of bits we can cram into the exec flags in execbuffer2. I think >>>> someone had an extended version that allowed more but I ripped it out >>>> because no one was using it. Of course, execbuffer3 might not >>>>have that >>>> problem at all. >>>> >>> >>>Thanks Jason. >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably >>>will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE >>>and somehow export it to user (I am thinking of embedding it in >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n >>>queues. >> >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3
Yup! That's exactly the limit I was talking about.
>>will also have. So, we can simply define in vm_bind/unbind structures, >> >>#define I915_VM_BIND_MAX_QUEUE 64 >> __u32 queue; >> >>I think that will keep things simple. > >Hmmm? What does execbuf2 limit has to do with how many engines >hardware can have? I suggest not to do that. > >Change with added this: > > if (set.num_engines > I915_EXEC_RING_MASK + 1) > return -EINVAL; > >To context creation needs to be undone and so let users create engine >maps with all hardware engines, and let execbuf3 access them all. >
Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also. Hence, I was using the same limit for VM_BIND queues (64, or 65 if we make it N+1). But, as discussed in other thread of this RFC series, we are planning to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be any uapi that limits the number of engines (and hence the vm_bind queues need to be supported).
If we leave the number of vm_bind queues to be arbitrarily large (__u32 queue_idx) then, we need to have a hashmap for queue (a wq, work_item and a linked list) lookup from the user specified queue index. Other option is to just put some hard limit (say 64 or 65) and use an array of queues in VM (each created upon first use). I prefer this.
I don't get why a VM_BIND queue is any different from any other queue or userspace-visible kernel object. But I'll leave those details up to danvet or whoever else might be reviewing the implementation. --Jason
I kind of agree here. Wouldn't be simpler to have the bind queue created like the others when we build the engine map?
For userspace it's then just matter of selecting the right queue ID when submitting.
If there is ever a possibility to have this work on the GPU, it would be all ready.
I did sync offline with Matt Brost on this. We can add a VM_BIND engine class and let user create VM_BIND engines (queues). The problem is, in i915 engine creating interface is bound to gem_context. So, in vm_bind ioctl, we would need both context_id and queue_idx for proper lookup of the user created engine. This is bit ackward as vm_bind is an interface to VM (address space) and has nothing to do with gem_context.
A gem_context has a single vm object right?
Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.
So it's just like picking up the vm like it's done at execbuffer time right now : eb->context->vm
Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained from the context?
Yes, because if we go for engines, they're associated with a context and so also associated with the VM bound to the context.
Hmm...context doesn't sould like the right interface. It should be VM and engine (independent of context). Engine can be virtual or soft engine (kernel thread), each with its own queue. We can add an interface to create such engines (independent of context). But we are anway implicitly creating it when user uses a new queue_idx. If in future we have hardware engines for VM_BIND operation, we can have that explicit inteface to create engine instances and the queue_index in vm_bind/unbind will point to those engines. Anyone has any thoughts? Daniel?
Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.
So a cleaner interface to me is: user space create a vm, create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.
I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.
From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.
I didn't completely follow the discussion here. Just share some thoughts.
Regards, Oak
Niranjana
I think the interface is clean as a interface to VM. It is only that we don't have a clean way to create a raw VM_BIND engine (not associated with any context) with i915 uapi. May be we can add such an interface, but I don't think that is worth it (we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned above). Anyone has any thoughts?
Another problem is, if two VMs are binding with the same defined engine, binding on VM1 can get unnecessary blocked by binding on VM2 (which may be waiting on its in_fence).
Maybe I'm missing something, but how can you have 2 vm objects with a single gem_context right now?
No, we don't have 2 VMs for a gem_context. Say if ctx1 with vm1 and ctx2 with vm2. First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If those two queue indicies points to same underlying vm_bind engine, then the second vm_bind call gets blocked until the first vm_bind call's 'in' fence is triggered and bind completes.
With per VM queues, this is not a problem as two VMs will not endup sharing same queue.
BTW, I just posted a updated PATCH series. https://www.spinics.net/lists/dri-devel/msg350483.html
Niranjana
So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind ioctl, and the queues are per VM.
Niranjana
Thanks,
-Lionel
Niranjana
>Regards, > >Tvrtko > >> >>Niranjana >> >>> >>>> I am trying to see how many queues we need and don't want it to be >>>> arbitrarily >>>> large and unduely blow up memory usage and complexity in i915 driver. >>>> >>>> I expect a Vulkan driver to use at most 2 in the vast majority >>>>of cases. I >>>> could imagine a client wanting to create more than 1 sparse >>>>queue in which >>>> case, it'll be N+1 but that's unlikely. As far as complexity >>>>goes, once >>>> you allow two, I don't think the complexity is going up by >>>>allowing N. As >>>> for memory usage, creating more queues means more memory. That's a >>>> trade-off that userspace can make. Again, the expected number >>>>here is 1 >>>> or 2 in the vast majority of cases so I don't think you need to worry. >>> >>>Ok, will start with n=3 meaning 8 queues. >>>That would require us create 8 workqueues. >>>We can change 'n' later if required. >>> >>>Niranjana >>> >>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>>operations and we >>>> don't >>>> > want any dependencies between them: >>>> > 1. Immediate. These happen right after BO creation or >>>>maybe as >>>> part of >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >>>>don't happen >>>> on a >>>> > queue and we don't want them serialized with anything. To >>>> synchronize >>>> > with submit, we'll have a syncobj in the VkDevice which is >>>> signaled by >>>> > all immediate bind operations and make submits wait on it. >>>> > 2. Queued (sparse): These happen on a VkQueue which may be the >>>> same as >>>> > a render/compute queue or may be its own queue. It's up to us >>>> what we >>>> > want to advertise. From the Vulkan API PoV, this is like any >>>> other >>>> > queue. Operations on it wait on and signal semaphores. If we >>>> have a >>>> > VM_BIND engine, we'd provide syncobjs to wait and >>>>signal just like >>>> we do >>>> > in execbuf(). >>>> > The important thing is that we don't want one type of >>>>operation to >>>> block >>>> > on the other. If immediate binds are blocking on sparse binds, >>>> it's >>>> > going to cause over-synchronization issues. >>>> > In terms of the internal implementation, I know that >>>>there's going >>>> to be >>>> > a lock on the VM and that we can't actually do these things in >>>> > parallel. That's fine. Once the dma_fences have signaled and >>>> we're >>>> >>>> Thats correct. It is like a single VM_BIND engine with >>>>multiple queues >>>> feeding to it. >>>> >>>> Right. As long as the queues themselves are independent and >>>>can block on >>>> dma_fences without holding up other queues, I think we're fine. >>>> >>>> > unblocked to do the bind operation, I don't care if >>>>there's a bit >>>> of >>>> > synchronization due to locking. That's expected. What >>>>we can't >>>> afford >>>> > to have is an immediate bind operation suddenly blocking on a >>>> sparse >>>> > operation which is blocked on a compute job that's going to run >>>> for >>>> > another 5ms. >>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the >>>> VM_BIND >>>> on other VMs. I am not sure about usecases here, but just wanted to >>>> clarify. >>>> >>>> Yes, that's what I would expect. >>>> --Jason >>>> >>>> Niranjana >>>> >>>> > For reference, Windows solves this by allowing arbitrarily many >>>> paging >>>> > queues (what they call a VM_BIND engine/queue). That >>>>design works >>>> > pretty well and solves the problems in question. >>>>Again, we could >>>> just >>>> > make everything out-of-order and require using syncobjs >>>>to order >>>> things >>>> > as userspace wants. That'd be fine too. >>>> > One more note while I'm here: danvet said something on >>>>IRC about >>>> VM_BIND >>>> > queues waiting for syncobjs to materialize. We don't really >>>> want/need >>>> > this. We already have all the machinery in userspace to handle >>>> > wait-before-signal and waiting for syncobj fences to >>>>materialize >>>> and >>>> > that machinery is on by default. It would actually >>>>take MORE work >>>> in >>>> > Mesa to turn it off and take advantage of the kernel >>>>being able to >>>> wait >>>> > for syncobjs to materialize. Also, getting that right is >>>> ridiculously >>>> > hard and I really don't want to get it wrong in kernel >>>>space. �� When we >>>> > do memory fences, wait-before-signal will be a thing. We don't >>>> need to >>>> > try and make it a thing for syncobj. >>>> > --Jason >>>> > >>>> > Thanks Jason, >>>> > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a >>>> sparse >>>> > queue that does not implement either graphics or compute >>>>operations >>>> : >>>> > >>>> > "While some implementations may include >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>> > support in queue families that also include >>>> > >>>> > graphics and compute support, other implementations may only >>>> expose a >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>> > >>>> > family." >>>> > >>>> > So it can all be all a vm_bind engine that just does bind/unbind >>>> > operations. >>>> > >>>> > But yes we need another engine for the immediate/non-sparse >>>> operations. >>>> > >>>> > -Lionel >>>> > >>>> > > >>>> > Daniel, any thoughts? >>>> > >>>> > Niranjana >>>> > >>>> > >Matt >>>> > > >>>> > >> >>>> > >> Sorry I noticed this late. >>>> > >> >>>> > >> >>>> > >> -Lionel >>>> > >> >>>> > >>
On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
Regards, Oak
-----Original Message----- From: Intel-gfx intel-gfx-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: June 10, 2022 1:43 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Intel GFX intel-gfx@lists.freedesktop.org; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P chris.p.wilson@intel.com; Vetter, Daniel daniel.vetter@intel.com; Christian König christian.koenig@amd.com Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:
On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote: > On 09/06/2022 00:55, Jason Ekstrand wrote: > > On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura > niranjana.vishwanathapura@intel.com wrote: > > On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote: > > > > > >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: > >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana >Vishwanathapura > wrote: > >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason >Ekstrand wrote: > >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura > >>>> niranjana.vishwanathapura@intel.com wrote: > >>>> > >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel >Landwerlin > wrote: > >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: > >>>> > > >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana >Vishwanathapura > >>>> > niranjana.vishwanathapura@intel.com wrote: > >>>> > > >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew > >>>>Brost wrote: > >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel > Landwerlin > >>>> wrote: > >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura > wrote: > >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start > >>>> binding/unbinding > >>>> > the mapping in an > >>>> > >> > +async worker. The binding and >unbinding will > >>>>work like a > >>>> special > >>>> > GPU engine. > >>>> > >> > +The binding and unbinding operations are > serialized and > >>>> will > >>>> > wait on specified > >>>> > >> > +input fences before the operation >and will signal > the > >>>> output > >>>> > fences upon the > >>>> > >> > +completion of the operation. Due to > serialization, > >>>> completion of > >>>> > an operation > >>>> > >> > +will also indicate that all >previous operations > >>>>are also > >>>> > complete. > >>>> > >> > >>>> > >> I guess we should avoid saying "will >immediately > start > >>>> > binding/unbinding" if > >>>> > >> there are fences involved. > >>>> > >> > >>>> > >> And the fact that it's happening in an async > >>>>worker seem to > >>>> imply > >>>> > it's not > >>>> > >> immediate. > >>>> > >> > >>>> > > >>>> > Ok, will fix. > >>>> > This was added because in earlier design >binding was > deferred > >>>> until > >>>> > next execbuff. > >>>> > But now it is non-deferred (immediate in >that sense). > >>>>But yah, > >>>> this is > >>>> > confusing > >>>> > and will fix it. > >>>> > > >>>> > >> > >>>> > >> I have a question on the behavior of the bind > >>>>operation when > >>>> no > >>>> > input fence > >>>> > >> is provided. Let say I do : > >>>> > >> > >>>> > >> VM_BIND (out_fence=fence1) > >>>> > >> > >>>> > >> VM_BIND (out_fence=fence2) > >>>> > >> > >>>> > >> VM_BIND (out_fence=fence3) > >>>> > >> > >>>> > >> > >>>> > >> In what order are the fences going to >be signaled? > >>>> > >> > >>>> > >> In the order of VM_BIND ioctls? Or out >of order? > >>>> > >> > >>>> > >> Because you wrote "serialized I assume >it's : in > order > >>>> > >> > >>>> > > >>>> > Yes, in the order of VM_BIND/UNBIND >ioctls. Note that > >>>>bind and > >>>> unbind > >>>> > will use > >>>> > the same queue and hence are ordered. > >>>> > > >>>> > >> > >>>> > >> One thing I didn't realize is that >because we only > get one > >>>> > "VM_BIND" engine, > >>>> > >> there is a disconnect from the Vulkan >specification. > >>>> > >> > >>>> > >> In Vulkan VM_BIND operations are >serialized but > >>>>per engine. > >>>> > >> > >>>> > >> So you could have something like this : > >>>> > >> > >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, > out_fence=fence2) > >>>> > >> > >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, > out_fence=fence4) > >>>> > >> > >>>> > >> > >>>> > >> fence1 is not signaled > >>>> > >> > >>>> > >> fence3 is signaled > >>>> > >> > >>>> > >> So the second VM_BIND will proceed before the > >>>>first VM_BIND. > >>>> > >> > >>>> > >> > >>>> > >> I guess we can deal with that scenario in > >>>>userspace by doing > >>>> the > >>>> > wait > >>>> > >> ourselves in one thread per engines. > >>>> > >> > >>>> > >> But then it makes the VM_BIND input >fences useless. > >>>> > >> > >>>> > >> > >>>> > >> Daniel : what do you think? Should be >rework this or > just > >>>> deal with > >>>> > wait > >>>> > >> fences in userspace? > >>>> > >> > >>>> > > > >>>> > >My opinion is rework this but make the >ordering via > >>>>an engine > >>>> param > >>>> > optional. > >>>> > > > >>>> > >e.g. A VM can be configured so all binds >are ordered > >>>>within the > >>>> VM > >>>> > > > >>>> > >e.g. A VM can be configured so all binds >accept an > engine > >>>> argument > >>>> > (in > >>>> > >the case of the i915 likely this is a >gem context > >>>>handle) and > >>>> binds > >>>> > >ordered with respect to that engine. > >>>> > > > >>>> > >This gives UMDs options as the later >likely consumes > >>>>more KMD > >>>> > resources > >>>> > >so if a different UMD can live with binds being > >>>>ordered within > >>>> the VM > >>>> > >they can use a mode consuming less resources. > >>>> > > > >>>> > > >>>> > I think we need to be careful here if we >are looking > for some > >>>> out of > >>>> > (submission) order completion of vm_bind/unbind. > >>>> > In-order completion means, in a batch of >binds and > >>>>unbinds to be > >>>> > completed in-order, user only needs to specify > >>>>in-fence for the > >>>> > first bind/unbind call and the our-fence >for the last > >>>> bind/unbind > >>>> > call. Also, the VA released by an unbind >call can be > >>>>re-used by > >>>> > any subsequent bind call in that in-order batch. > >>>> > > >>>> > These things will break if >binding/unbinding were to > >>>>be allowed > >>>> to > >>>> > go out of order (of submission) and user >need to be > extra > >>>> careful > >>>> > not to run into pre-mature triggereing of >out-fence and > bind > >>>> failing > >>>> > as VA is still in use etc. > >>>> > > >>>> > Also, VM_BIND binds the provided mapping on the > specified > >>>> address > >>>> > space > >>>> > (VM). So, the uapi is not engine/context >specific. > >>>> > > >>>> > We can however add a 'queue' to the uapi >which can be > >>>>one from > >>>> the > >>>> > pre-defined queues, > >>>> > I915_VM_BIND_QUEUE_0 > >>>> > I915_VM_BIND_QUEUE_1 > >>>> > ... > >>>> > I915_VM_BIND_QUEUE_(N-1) > >>>> > > >>>> > KMD will spawn an async work queue for >each queue which > will > >>>> only > >>>> > bind the mappings on that queue in the order of > submission. > >>>> > User can assign the queue to per engine >or anything > >>>>like that. > >>>> > > >>>> > But again here, user need to be careful and not > >>>>deadlock these > >>>> > queues with circular dependency of fences. > >>>> > > >>>> > I prefer adding this later an as >extension based on > >>>>whether it > >>>> > is really helping with the implementation. > >>>> > > >>>> > I can tell you right now that having >everything on a > single > >>>> in-order > >>>> > queue will not get us the perf we want. >What vulkan > >>>>really wants > >>>> is one > >>>> > of two things: > >>>> > 1. No implicit ordering of VM_BIND ops. They just > happen in > >>>> whatever > >>>> > their dependencies are resolved and we >ensure ordering > >>>>ourselves > >>>> by > >>>> > having a syncobj in the VkQueue. > >>>> > 2. The ability to create multiple VM_BIND >queues. We > need at > >>>> least 2 > >>>> > but I don't see why there needs to be a >limit besides > >>>>the limits > >>>> the > >>>> > i915 API already has on the number of >engines. Vulkan > could > >>>> expose > >>>> > multiple sparse binding queues to the >client if it's not > >>>> arbitrarily > >>>> > limited. > >>>> > >>>> Thanks Jason, Lionel. > >>>> > >>>> Jason, what are you referring to when you say >"limits the i915 > API > >>>> already > >>>> has on the number of engines"? I am not sure if >there is such > an uapi > >>>> today. > >>>> > >>>> There's a limit of something like 64 total engines >today based on > the > >>>> number of bits we can cram into the exec flags in >execbuffer2. I > think > >>>> someone had an extended version that allowed more >but I ripped it > out > >>>> because no one was using it. Of course, >execbuffer3 might not > >>>>have that > >>>> problem at all. > >>>> > >>> > >>>Thanks Jason. > >>>Ok, I am not sure which exec flag is that, but yah, >execbuffer3 > probably > >>>will not have this limiation. So, we need to define a > VM_BIND_MAX_QUEUE > >>>and somehow export it to user (I am thinking of >embedding it in > >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' > meaning 2^n > >>>queues. > >> > >>Ah, I think you are waking about I915_EXEC_RING_MASK >(0x3f) which > execbuf3 > > Yup! That's exactly the limit I was talking about. > > >>will also have. So, we can simply define in vm_bind/unbind > structures, > >> > >>#define I915_VM_BIND_MAX_QUEUE 64 > >> __u32 queue; > >> > >>I think that will keep things simple. > > > >Hmmm? What does execbuf2 limit has to do with how many engines > >hardware can have? I suggest not to do that. > > > >Change with added this: > > > > if (set.num_engines > I915_EXEC_RING_MASK + 1) > > return -EINVAL; > > > >To context creation needs to be undone and so let users >create engine > >maps with all hardware engines, and let execbuf3 access >them all. > > > > Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to >execbuff3 also. > Hence, I was using the same limit for VM_BIND queues >(64, or 65 if we > make it N+1). > But, as discussed in other thread of this RFC series, we >are planning > to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be > any uapi that limits the number of engines (and hence >the vm_bind > queues > need to be supported). > > If we leave the number of vm_bind queues to be arbitrarily large > (__u32 queue_idx) then, we need to have a hashmap for >queue (a wq, > work_item and a linked list) lookup from the user >specified queue > index. > Other option is to just put some hard limit (say 64 or >65) and use > an array of queues in VM (each created upon first use). >I prefer this. > > I don't get why a VM_BIND queue is any different from any >other queue or > userspace-visible kernel object. But I'll leave those >details up to > danvet or whoever else might be reviewing the implementation. > --Jason > > I kind of agree here. Wouldn't be simpler to have the bind >queue created > like the others when we build the engine map? > > For userspace it's then just matter of selecting the right >queue ID when > submitting. > > If there is ever a possibility to have this work on the GPU, >it would be > all ready. >
I did sync offline with Matt Brost on this. We can add a VM_BIND engine class and let user create VM_BIND engines (queues). The problem is, in i915 engine creating interface is bound to gem_context. So, in vm_bind ioctl, we would need both context_id and queue_idx for proper lookup of the user created engine. This is bit ackward as vm_bind is an interface to VM (address space) and has nothing to do with gem_context.
A gem_context has a single vm object right?
Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.
So it's just like picking up the vm like it's done at execbuffer time right now : eb->context->vm
Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained from the context?
Yes, because if we go for engines, they're associated with a context and so also associated with the VM bound to the context.
Hmm...context doesn't sould like the right interface. It should be VM and engine (independent of context). Engine can be virtual or soft engine (kernel thread), each with its own queue. We can add an interface to create such engines (independent of context). But we are anway implicitly creating it when user uses a new queue_idx. If in future we have hardware engines for VM_BIND operation, we can have that explicit inteface to create engine instances and the queue_index in vm_bind/unbind will point to those engines. Anyone has any thoughts? Daniel?
Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.
So a cleaner interface to me is: user space create a vm, create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.
I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.
From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.
I didn't completely follow the discussion here. Just share some thoughts.
Yah, I agree.
Lionel, How about we define the queue as union { __u32 queue_idx; __u64 rsvd; }
If required, we can extend by expanding the 'rsvd' field to <ctx_id, queue_idx> later with a flag.
Niranjana
Regards, Oak
Niranjana
I think the interface is clean as a interface to VM. It is only that we don't have a clean way to create a raw VM_BIND engine (not associated with any context) with i915 uapi. May be we can add such an interface, but I don't think that is worth it (we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned above). Anyone has any thoughts?
Another problem is, if two VMs are binding with the same defined engine, binding on VM1 can get unnecessary blocked by binding on VM2 (which may be waiting on its in_fence).
Maybe I'm missing something, but how can you have 2 vm objects with a single gem_context right now?
No, we don't have 2 VMs for a gem_context. Say if ctx1 with vm1 and ctx2 with vm2. First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If those two queue indicies points to same underlying vm_bind engine, then the second vm_bind call gets blocked until the first vm_bind call's 'in' fence is triggered and bind completes.
With per VM queues, this is not a problem as two VMs will not endup sharing same queue.
BTW, I just posted a updated PATCH series. https://www.spinics.net/lists/dri-devel/msg350483.html
Niranjana
So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind ioctl, and the queues are per VM.
Niranjana
> Thanks, > > -Lionel > > > Niranjana > > >Regards, > > > >Tvrtko > > > >> > >>Niranjana > >> > >>> > >>>> I am trying to see how many queues we need and >don't want it to > be > >>>> arbitrarily > >>>> large and unduely blow up memory usage and >complexity in i915 > driver. > >>>> > >>>> I expect a Vulkan driver to use at most 2 in the >vast majority > >>>>of cases. I > >>>> could imagine a client wanting to create more than 1 sparse > >>>>queue in which > >>>> case, it'll be N+1 but that's unlikely. As far as >complexity > >>>>goes, once > >>>> you allow two, I don't think the complexity is going up by > >>>>allowing N. As > >>>> for memory usage, creating more queues means more >memory. That's > a > >>>> trade-off that userspace can make. Again, the >expected number > >>>>here is 1 > >>>> or 2 in the vast majority of cases so I don't think >you need to > worry. > >>> > >>>Ok, will start with n=3 meaning 8 queues. > >>>That would require us create 8 workqueues. > >>>We can change 'n' later if required. > >>> > >>>Niranjana > >>> > >>>> > >>>> > Why? Because Vulkan has two basic kind of bind > >>>>operations and we > >>>> don't > >>>> > want any dependencies between them: > >>>> > 1. Immediate. These happen right after BO >creation or > >>>>maybe as > >>>> part of > >>>> > vkBindImageMemory() or VkBindBufferMemory(). These > >>>>don't happen > >>>> on a > >>>> > queue and we don't want them serialized >with anything. To > >>>> synchronize > >>>> > with submit, we'll have a syncobj in the >VkDevice which > is > >>>> signaled by > >>>> > all immediate bind operations and make >submits wait on > it. > >>>> > 2. Queued (sparse): These happen on a >VkQueue which may > be the > >>>> same as > >>>> > a render/compute queue or may be its own >queue. It's up > to us > >>>> what we > >>>> > want to advertise. From the Vulkan API >PoV, this is like > any > >>>> other > >>>> > queue. Operations on it wait on and signal >semaphores. If we > >>>> have a > >>>> > VM_BIND engine, we'd provide syncobjs to wait and > >>>>signal just like > >>>> we do > >>>> > in execbuf(). > >>>> > The important thing is that we don't want >one type of > >>>>operation to > >>>> block > >>>> > on the other. If immediate binds are >blocking on sparse > binds, > >>>> it's > >>>> > going to cause over-synchronization issues. > >>>> > In terms of the internal implementation, I >know that > >>>>there's going > >>>> to be > >>>> > a lock on the VM and that we can't actually >do these > things in > >>>> > parallel. That's fine. Once the dma_fences have > signaled and > >>>> we're > >>>> > >>>> Thats correct. It is like a single VM_BIND engine with > >>>>multiple queues > >>>> feeding to it. > >>>> > >>>> Right. As long as the queues themselves are >independent and > >>>>can block on > >>>> dma_fences without holding up other queues, I think >we're fine. > >>>> > >>>> > unblocked to do the bind operation, I don't care if > >>>>there's a bit > >>>> of > >>>> > synchronization due to locking. That's >expected. What > >>>>we can't > >>>> afford > >>>> > to have is an immediate bind operation >suddenly blocking > on a > >>>> sparse > >>>> > operation which is blocked on a compute job >that's going > to run > >>>> for > >>>> > another 5ms. > >>>> > >>>> As the VM_BIND queue is per VM, VM_BIND on one VM >doesn't block > the > >>>> VM_BIND > >>>> on other VMs. I am not sure about usecases here, but just > wanted to > >>>> clarify. > >>>> > >>>> Yes, that's what I would expect. > >>>> --Jason > >>>> > >>>> Niranjana > >>>> > >>>> > For reference, Windows solves this by allowing > arbitrarily many > >>>> paging > >>>> > queues (what they call a VM_BIND >engine/queue). That > >>>>design works > >>>> > pretty well and solves the problems in >question. >>>>Again, we could > >>>> just > >>>> > make everything out-of-order and require >using syncobjs > >>>>to order > >>>> things > >>>> > as userspace wants. That'd be fine too. > >>>> > One more note while I'm here: danvet said >something on > >>>>IRC about > >>>> VM_BIND > >>>> > queues waiting for syncobjs to >materialize. We don't > really > >>>> want/need > >>>> > this. We already have all the machinery in >userspace to > handle > >>>> > wait-before-signal and waiting for syncobj >fences to > >>>>materialize > >>>> and > >>>> > that machinery is on by default. It would actually > >>>>take MORE work > >>>> in > >>>> > Mesa to turn it off and take advantage of >the kernel > >>>>being able to > >>>> wait > >>>> > for syncobjs to materialize. Also, getting >that right is > >>>> ridiculously > >>>> > hard and I really don't want to get it >wrong in kernel > >>>>space. �� When we > >>>> > do memory fences, wait-before-signal will >be a thing. We > don't > >>>> need to > >>>> > try and make it a thing for syncobj. > >>>> > --Jason > >>>> > > >>>> > Thanks Jason, > >>>> > > >>>> > I missed the bit in the Vulkan spec that >we're allowed to > have a > >>>> sparse > >>>> > queue that does not implement either graphics >or compute > >>>>operations > >>>> : > >>>> > > >>>> > "While some implementations may include > >>>> VK_QUEUE_SPARSE_BINDING_BIT > >>>> > support in queue families that also include > >>>> > > >>>> > graphics and compute support, other >implementations may > only > >>>> expose a > >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue > >>>> > > >>>> > family." > >>>> > > >>>> > So it can all be all a vm_bind engine that just does > bind/unbind > >>>> > operations. > >>>> > > >>>> > But yes we need another engine for the >immediate/non-sparse > >>>> operations. > >>>> > > >>>> > -Lionel > >>>> > > >>>> > > > >>>> > Daniel, any thoughts? > >>>> > > >>>> > Niranjana > >>>> > > >>>> > >Matt > >>>> > > > >>>> > >> > >>>> > >> Sorry I noticed this late. > >>>> > >> > >>>> > >> > >>>> > >> -Lionel > >>>> > >> > >>>> > >>
On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
Regards, Oak
-----Original Message----- From: Intel-gfx intel-gfx-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: June 10, 2022 1:43 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Intel GFX intel-gfx@lists.freedesktop.org; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P chris.p.wilson@intel.com; Vetter, Daniel daniel.vetter@intel.com; Christian König christian.koenig@amd.com Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:
On 09/06/2022 22:31, Niranjana Vishwanathapura wrote: >On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote: >> On 09/06/2022 00:55, Jason Ekstrand wrote: >> >> On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura >> niranjana.vishwanathapura@intel.com wrote: >> >> On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin
wrote:
>> > >> > >> >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >> >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana >>Vishwanathapura >> wrote: >> >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason >>Ekstrand wrote: >> >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana
Vishwanathapura
>> >>>> niranjana.vishwanathapura@intel.com wrote: >> >>>> >> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel >>Landwerlin >> wrote: >> >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: >> >>>> > >> >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana >>Vishwanathapura >> >>>> > niranjana.vishwanathapura@intel.com wrote: >> >>>> > >> >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700,
Matthew
>> >>>>Brost wrote: >> >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300,
Lionel
>> Landwerlin >> >>>> wrote: >> >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura >> wrote: >> >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start >> >>>> binding/unbinding >> >>>> > the mapping in an >> >>>> > >> > +async worker. The binding and >>unbinding will >> >>>>work like a >> >>>> special >> >>>> > GPU engine. >> >>>> > >> > +The binding and unbinding operations are >> serialized and >> >>>> will >> >>>> > wait on specified >> >>>> > >> > +input fences before the operation >>and will signal >> the >> >>>> output >> >>>> > fences upon the >> >>>> > >> > +completion of the operation. Due to >> serialization, >> >>>> completion of >> >>>> > an operation >> >>>> > >> > +will also indicate that all >>previous operations >> >>>>are also >> >>>> > complete. >> >>>> > >> >> >>>> > >> I guess we should avoid saying "will >>immediately >> start >> >>>> > binding/unbinding" if >> >>>> > >> there are fences involved. >> >>>> > >> >> >>>> > >> And the fact that it's happening in an async >> >>>>worker seem to >> >>>> imply >> >>>> > it's not >> >>>> > >> immediate. >> >>>> > >> >> >>>> > >> >>>> > Ok, will fix. >> >>>> > This was added because in earlier design >>binding was >> deferred >> >>>> until >> >>>> > next execbuff. >> >>>> > But now it is non-deferred (immediate in >>that sense). >> >>>>But yah, >> >>>> this is >> >>>> > confusing >> >>>> > and will fix it. >> >>>> > >> >>>> > >> >> >>>> > >> I have a question on the behavior of the bind >> >>>>operation when >> >>>> no >> >>>> > input fence >> >>>> > >> is provided. Let say I do : >> >>>> > >> >> >>>> > >> VM_BIND (out_fence=fence1) >> >>>> > >> >> >>>> > >> VM_BIND (out_fence=fence2) >> >>>> > >> >> >>>> > >> VM_BIND (out_fence=fence3) >> >>>> > >> >> >>>> > >> >> >>>> > >> In what order are the fences going to >>be signaled? >> >>>> > >> >> >>>> > >> In the order of VM_BIND ioctls? Or out >>of order? >> >>>> > >> >> >>>> > >> Because you wrote "serialized I assume >>it's : in >> order >> >>>> > >> >> >>>> > >> >>>> > Yes, in the order of VM_BIND/UNBIND >>ioctls. Note that >> >>>>bind and >> >>>> unbind >> >>>> > will use >> >>>> > the same queue and hence are ordered. >> >>>> > >> >>>> > >> >> >>>> > >> One thing I didn't realize is that >>because we only >> get one >> >>>> > "VM_BIND" engine, >> >>>> > >> there is a disconnect from the Vulkan >>specification. >> >>>> > >> >> >>>> > >> In Vulkan VM_BIND operations are >>serialized but >> >>>>per engine. >> >>>> > >> >> >>>> > >> So you could have something like this : >> >>>> > >> >> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, >> out_fence=fence2) >> >>>> > >> >> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, >> out_fence=fence4) >> >>>> > >> >> >>>> > >> >> >>>> > >> fence1 is not signaled >> >>>> > >> >> >>>> > >> fence3 is signaled >> >>>> > >> >> >>>> > >> So the second VM_BIND will proceed before the >> >>>>first VM_BIND. >> >>>> > >> >> >>>> > >> >> >>>> > >> I guess we can deal with that scenario in >> >>>>userspace by doing >> >>>> the >> >>>> > wait >> >>>> > >> ourselves in one thread per engines. >> >>>> > >> >> >>>> > >> But then it makes the VM_BIND input >>fences useless. >> >>>> > >> >> >>>> > >> >> >>>> > >> Daniel : what do you think? Should be >>rework this or >> just >> >>>> deal with >> >>>> > wait >> >>>> > >> fences in userspace? >> >>>> > >> >> >>>> > > >> >>>> > >My opinion is rework this but make the >>ordering via >> >>>>an engine >> >>>> param >> >>>> > optional. >> >>>> > > >> >>>> > >e.g. A VM can be configured so all binds >>are ordered >> >>>>within the >> >>>> VM >> >>>> > > >> >>>> > >e.g. A VM can be configured so all binds >>accept an >> engine >> >>>> argument >> >>>> > (in >> >>>> > >the case of the i915 likely this is a >>gem context >> >>>>handle) and >> >>>> binds >> >>>> > >ordered with respect to that engine. >> >>>> > > >> >>>> > >This gives UMDs options as the later >>likely consumes >> >>>>more KMD >> >>>> > resources >> >>>> > >so if a different UMD can live with binds
being
>> >>>>ordered within >> >>>> the VM >> >>>> > >they can use a mode consuming less resources. >> >>>> > > >> >>>> > >> >>>> > I think we need to be careful here if we >>are looking >> for some >> >>>> out of >> >>>> > (submission) order completion of vm_bind/unbind. >> >>>> > In-order completion means, in a batch of >>binds and >> >>>>unbinds to be >> >>>> > completed in-order, user only needs to specify >> >>>>in-fence for the >> >>>> > first bind/unbind call and the our-fence >>for the last >> >>>> bind/unbind >> >>>> > call. Also, the VA released by an unbind >>call can be >> >>>>re-used by >> >>>> > any subsequent bind call in that in-order
batch.
>> >>>> > >> >>>> > These things will break if >>binding/unbinding were to >> >>>>be allowed >> >>>> to >> >>>> > go out of order (of submission) and user >>need to be >> extra >> >>>> careful >> >>>> > not to run into pre-mature triggereing of >>out-fence and >> bind >> >>>> failing >> >>>> > as VA is still in use etc. >> >>>> > >> >>>> > Also, VM_BIND binds the provided mapping
on the
>> specified >> >>>> address >> >>>> > space >> >>>> > (VM). So, the uapi is not engine/context >>specific. >> >>>> > >> >>>> > We can however add a 'queue' to the uapi >>which can be >> >>>>one from >> >>>> the >> >>>> > pre-defined queues, >> >>>> > I915_VM_BIND_QUEUE_0 >> >>>> > I915_VM_BIND_QUEUE_1 >> >>>> > ... >> >>>> > I915_VM_BIND_QUEUE_(N-1) >> >>>> > >> >>>> > KMD will spawn an async work queue for >>each queue which >> will >> >>>> only >> >>>> > bind the mappings on that queue in the
order of
>> submission. >> >>>> > User can assign the queue to per engine >>or anything >> >>>>like that. >> >>>> > >> >>>> > But again here, user need to be careful
and not
>> >>>>deadlock these >> >>>> > queues with circular dependency of fences. >> >>>> > >> >>>> > I prefer adding this later an as >>extension based on >> >>>>whether it >> >>>> > is really helping with the implementation. >> >>>> > >> >>>> > I can tell you right now that having >>everything on a >> single >> >>>> in-order >> >>>> > queue will not get us the perf we want. >>What vulkan >> >>>>really wants >> >>>> is one >> >>>> > of two things: >> >>>> > 1. No implicit ordering of VM_BIND ops.
They just
>> happen in >> >>>> whatever >> >>>> > their dependencies are resolved and we >>ensure ordering >> >>>>ourselves >> >>>> by >> >>>> > having a syncobj in the VkQueue. >> >>>> > 2. The ability to create multiple VM_BIND >>queues. We >> need at >> >>>> least 2 >> >>>> > but I don't see why there needs to be a >>limit besides >> >>>>the limits >> >>>> the >> >>>> > i915 API already has on the number of >>engines. Vulkan >> could >> >>>> expose >> >>>> > multiple sparse binding queues to the >>client if it's not >> >>>> arbitrarily >> >>>> > limited. >> >>>> >> >>>> Thanks Jason, Lionel. >> >>>> >> >>>> Jason, what are you referring to when you say >>"limits the i915 >> API >> >>>> already >> >>>> has on the number of engines"? I am not sure if >>there is such >> an uapi >> >>>> today. >> >>>> >> >>>> There's a limit of something like 64 total engines >>today based on >> the >> >>>> number of bits we can cram into the exec flags in >>execbuffer2. I >> think >> >>>> someone had an extended version that allowed more >>but I ripped it >> out >> >>>> because no one was using it. Of course, >>execbuffer3 might not >> >>>>have that >> >>>> problem at all. >> >>>> >> >>> >> >>>Thanks Jason. >> >>>Ok, I am not sure which exec flag is that, but yah, >>execbuffer3 >> probably >> >>>will not have this limiation. So, we need to define a >> VM_BIND_MAX_QUEUE >> >>>and somehow export it to user (I am thinking of >>embedding it in >> >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND,
bits[1-3]->'n'
>> meaning 2^n >> >>>queues. >> >> >> >>Ah, I think you are waking about I915_EXEC_RING_MASK >>(0x3f) which >> execbuf3 >> >> Yup! That's exactly the limit I was talking about. >> >> >>will also have. So, we can simply define in vm_bind/unbind >> structures, >> >> >> >>#define I915_VM_BIND_MAX_QUEUE 64 >> >> __u32 queue; >> >> >> >>I think that will keep things simple. >> > >> >Hmmm? What does execbuf2 limit has to do with how many
engines
>> >hardware can have? I suggest not to do that. >> > >> >Change with added this: >> > >> > if (set.num_engines > I915_EXEC_RING_MASK + 1) >> > return -EINVAL; >> > >> >To context creation needs to be undone and so let users >>create engine >> >maps with all hardware engines, and let execbuf3 access >>them all. >> > >> >> Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to >>execbuff3 also. >> Hence, I was using the same limit for VM_BIND queues >>(64, or 65 if we >> make it N+1). >> But, as discussed in other thread of this RFC series, we >>are planning >> to drop this I915_EXEC_RING_MAP in execbuff3. So, there
won't be
>> any uapi that limits the number of engines (and hence >>the vm_bind >> queues >> need to be supported). >> >> If we leave the number of vm_bind queues to be
arbitrarily large
>> (__u32 queue_idx) then, we need to have a hashmap for >>queue (a wq, >> work_item and a linked list) lookup from the user >>specified queue >> index. >> Other option is to just put some hard limit (say 64 or >>65) and use >> an array of queues in VM (each created upon first use). >>I prefer this. >> >> I don't get why a VM_BIND queue is any different from any >>other queue or >> userspace-visible kernel object. But I'll leave those >>details up to >> danvet or whoever else might be reviewing the implementation. >> --Jason >> >> I kind of agree here. Wouldn't be simpler to have the bind >>queue created >> like the others when we build the engine map? >> >> For userspace it's then just matter of selecting the right >>queue ID when >> submitting. >> >> If there is ever a possibility to have this work on the GPU, >>it would be >> all ready. >> > >I did sync offline with Matt Brost on this. >We can add a VM_BIND engine class and let user create VM_BIND >engines (queues). >The problem is, in i915 engine creating interface is bound to >gem_context. >So, in vm_bind ioctl, we would need both context_id and >queue_idx for proper >lookup of the user created engine. This is bit ackward as
vm_bind is an
>interface to VM (address space) and has nothing to do with
gem_context.
A gem_context has a single vm object right?
Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.
So it's just like picking up the vm like it's done at execbuffer time right now : eb->context->vm
Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained from the context?
Yes, because if we go for engines, they're associated with a context and so also associated with the VM bound to the context.
Hmm...context doesn't sould like the right interface. It should be VM and engine (independent of context). Engine can be virtual or soft engine (kernel thread), each with its own queue. We can add an interface to create such engines (independent of context). But we are anway implicitly creating it when user uses a new queue_idx. If in future we have hardware engines for VM_BIND operation, we can have that explicit inteface to create engine instances and the queue_index in vm_bind/unbind will point to those engines. Anyone has any thoughts? Daniel?
Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.
So a cleaner interface to me is: user space create a vm, create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.
I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.
From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.
I didn't completely follow the discussion here. Just share some thoughts.
Yah, I agree.
Lionel, How about we define the queue as union { __u32 queue_idx; __u64 rsvd; }
If required, we can extend by expanding the 'rsvd' field to <ctx_id, queue_idx> later with a flag.
Niranjana
I did not really understand Oak's comment nor what you're suggesting here to be honest.
First the GEM context is already exposed to userspace. It's explicitly created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.
We give the GEM context id in every execbuffer we do with drm_i915_gem_execbuffer2::rsvd1.
It's still in the new execbuffer3 proposal being discussed.
Second, the GEM context is also where we set the VM with I915_CONTEXT_PARAM_VM.
Third, the GEM context also has the list of engines with I915_CONTEXT_PARAM_ENGINES.
So it makes sense to me to dispatch the vm_bind operation to a GEM context, to a given vm_bind queue, because it's got all the information required :
- the list of new vm_bind queues
- the vm that is going to be modified
Otherwise where do the vm_bind queues live?
In the i915/drm fd object?
That would mean that all the GEM contexts are sharing the same vm_bind queues.
intel_context or GuC are internal details we're not concerned about.
I don't really see the connection with the GEM context.
Maybe Oak has a different use case than Vulkan.
-Lionel
Regards, Oak
Niranjana
I think the interface is clean as a interface to VM. It is only
that we
don't have a clean way to create a raw VM_BIND engine (not associated with any context) with i915 uapi. May be we can add such an interface, but I don't think that is
worth it
(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned above). Anyone has any thoughts?
>Another problem is, if two VMs are binding with the same defined >engine, >binding on VM1 can get unnecessary blocked by binding on VM2 >(which may be >waiting on its in_fence).
Maybe I'm missing something, but how can you have 2 vm objects with a single gem_context right now?
No, we don't have 2 VMs for a gem_context. Say if ctx1 with vm1 and ctx2 with vm2. First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If those two queue indicies points to same underlying vm_bind engine, then the second vm_bind call gets blocked until the first vm_bind
call's
'in' fence is triggered and bind completes.
With per VM queues, this is not a problem as two VMs will not endup sharing same queue.
BTW, I just posted a updated PATCH series. https://www.spinics.net/lists/dri-devel/msg350483.html
Niranjana
> >So, my preference here is to just add a 'u32 queue' index in >vm_bind/unbind >ioctl, and the queues are per VM. > >Niranjana > >> Thanks, >> >> -Lionel >> >> >> Niranjana >> >> >Regards, >> > >> >Tvrtko >> > >> >> >> >>Niranjana >> >> >> >>> >> >>>> I am trying to see how many queues we need and >>don't want it to >> be >> >>>> arbitrarily >> >>>> large and unduely blow up memory usage and >>complexity in i915 >> driver. >> >>>> >> >>>> I expect a Vulkan driver to use at most 2 in the >>vast majority >> >>>>of cases. I >> >>>> could imagine a client wanting to create more than 1
sparse
>> >>>>queue in which >> >>>> case, it'll be N+1 but that's unlikely. As far as >>complexity >> >>>>goes, once >> >>>> you allow two, I don't think the complexity is going
up by
>> >>>>allowing N. As >> >>>> for memory usage, creating more queues means more >>memory. That's >> a >> >>>> trade-off that userspace can make. Again, the >>expected number >> >>>>here is 1 >> >>>> or 2 in the vast majority of cases so I don't think >>you need to >> worry. >> >>> >> >>>Ok, will start with n=3 meaning 8 queues. >> >>>That would require us create 8 workqueues. >> >>>We can change 'n' later if required. >> >>> >> >>>Niranjana >> >>> >> >>>> >> >>>> > Why? Because Vulkan has two basic kind of bind >> >>>>operations and we >> >>>> don't >> >>>> > want any dependencies between them: >> >>>> > 1. Immediate. These happen right after BO >>creation or >> >>>>maybe as >> >>>> part of >> >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >> >>>>don't happen >> >>>> on a >> >>>> > queue and we don't want them serialized >>with anything. To >> >>>> synchronize >> >>>> > with submit, we'll have a syncobj in the >>VkDevice which >> is >> >>>> signaled by >> >>>> > all immediate bind operations and make >>submits wait on >> it. >> >>>> > 2. Queued (sparse): These happen on a >>VkQueue which may >> be the >> >>>> same as >> >>>> > a render/compute queue or may be its own >>queue. It's up >> to us >> >>>> what we >> >>>> > want to advertise. From the Vulkan API >>PoV, this is like >> any >> >>>> other >> >>>> > queue. Operations on it wait on and signal >>semaphores. If we >> >>>> have a >> >>>> > VM_BIND engine, we'd provide syncobjs to
wait and
>> >>>>signal just like >> >>>> we do >> >>>> > in execbuf(). >> >>>> > The important thing is that we don't want >>one type of >> >>>>operation to >> >>>> block >> >>>> > on the other. If immediate binds are >>blocking on sparse >> binds, >> >>>> it's >> >>>> > going to cause over-synchronization issues. >> >>>> > In terms of the internal implementation, I >>know that >> >>>>there's going >> >>>> to be >> >>>> > a lock on the VM and that we can't actually >>do these >> things in >> >>>> > parallel. That's fine. Once the dma_fences have >> signaled and >> >>>> we're >> >>>> >> >>>> Thats correct. It is like a single VM_BIND engine
with
>> >>>>multiple queues >> >>>> feeding to it. >> >>>> >> >>>> Right. As long as the queues themselves are >>independent and >> >>>>can block on >> >>>> dma_fences without holding up other queues, I think >>we're fine. >> >>>> >> >>>> > unblocked to do the bind operation, I don't care if >> >>>>there's a bit >> >>>> of >> >>>> > synchronization due to locking. That's >>expected. What >> >>>>we can't >> >>>> afford >> >>>> > to have is an immediate bind operation >>suddenly blocking >> on a >> >>>> sparse >> >>>> > operation which is blocked on a compute job >>that's going >> to run >> >>>> for >> >>>> > another 5ms. >> >>>> >> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM >>doesn't block >> the >> >>>> VM_BIND >> >>>> on other VMs. I am not sure about usecases here,
but just
>> wanted to >> >>>> clarify. >> >>>> >> >>>> Yes, that's what I would expect. >> >>>> --Jason >> >>>> >> >>>> Niranjana >> >>>> >> >>>> > For reference, Windows solves this by allowing >> arbitrarily many >> >>>> paging >> >>>> > queues (what they call a VM_BIND >>engine/queue). That >> >>>>design works >> >>>> > pretty well and solves the problems in >>question. >>>>Again, we could >> >>>> just >> >>>> > make everything out-of-order and require >>using syncobjs >> >>>>to order >> >>>> things >> >>>> > as userspace wants. That'd be fine too. >> >>>> > One more note while I'm here: danvet said >>something on >> >>>>IRC about >> >>>> VM_BIND >> >>>> > queues waiting for syncobjs to >>materialize. We don't >> really >> >>>> want/need >> >>>> > this. We already have all the machinery in >>userspace to >> handle >> >>>> > wait-before-signal and waiting for syncobj >>fences to >> >>>>materialize >> >>>> and >> >>>> > that machinery is on by default. It would
actually
>> >>>>take MORE work >> >>>> in >> >>>> > Mesa to turn it off and take advantage of >>the kernel >> >>>>being able to >> >>>> wait >> >>>> > for syncobjs to materialize. Also, getting >>that right is >> >>>> ridiculously >> >>>> > hard and I really don't want to get it >>wrong in kernel >> >>>>space. �� When we >> >>>> > do memory fences, wait-before-signal will >>be a thing. We >> don't >> >>>> need to >> >>>> > try and make it a thing for syncobj. >> >>>> > --Jason >> >>>> > >> >>>> > Thanks Jason, >> >>>> > >> >>>> > I missed the bit in the Vulkan spec that >>we're allowed to >> have a >> >>>> sparse >> >>>> > queue that does not implement either graphics >>or compute >> >>>>operations >> >>>> : >> >>>> > >> >>>> > "While some implementations may include >> >>>> VK_QUEUE_SPARSE_BINDING_BIT >> >>>> > support in queue families that also include >> >>>> > >> >>>> > graphics and compute support, other >>implementations may >> only >> >>>> expose a >> >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >> >>>> > >> >>>> > family." >> >>>> > >> >>>> > So it can all be all a vm_bind engine that
just does
>> bind/unbind >> >>>> > operations. >> >>>> > >> >>>> > But yes we need another engine for the >>immediate/non-sparse >> >>>> operations. >> >>>> > >> >>>> > -Lionel >> >>>> > >> >>>> > > >> >>>> > Daniel, any thoughts? >> >>>> > >> >>>> > Niranjana >> >>>> > >> >>>> > >Matt >> >>>> > > >> >>>> > >> >> >>>> > >> Sorry I noticed this late. >> >>>> > >> >> >>>> > >> >> >>>> > >> -Lionel >> >>>> > >> >> >>>> > >>
On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:
On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
Regards, Oak
-----Original Message----- From: Intel-gfx intel-gfx-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: June 10, 2022 1:43 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Intel GFX intel-gfx@lists.freedesktop.org; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P chris.p.wilson@intel.com; Vetter, Daniel daniel.vetter@intel.com; Christian König christian.koenig@amd.com Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:
On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote: >On 09/06/2022 22:31, Niranjana Vishwanathapura wrote: >>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote: >>> On 09/06/2022 00:55, Jason Ekstrand wrote: >>> >>> On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura >>> niranjana.vishwanathapura@intel.com wrote: >>> >>> On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko
Ursulin wrote:
>>> > >>> > >>> >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>> >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana >>>Vishwanathapura >>> wrote: >>> >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason >>>Ekstrand wrote: >>> >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana
Vishwanathapura
>>> >>>> niranjana.vishwanathapura@intel.com wrote: >>> >>>> >>> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel >>>Landwerlin >>> wrote: >>> >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: >>> >>>> > >>> >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana >>>Vishwanathapura >>> >>>> > niranjana.vishwanathapura@intel.com wrote: >>> >>>> > >>> >>>> > On Wed, Jun 01, 2022 at 01:28:36PM
-0700, Matthew
>>> >>>>Brost wrote: >>> >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM
+0300, Lionel
>>> Landwerlin >>> >>>> wrote: >>> >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura >>> wrote: >>> >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start >>> >>>> binding/unbinding >>> >>>> > the mapping in an >>> >>>> > >> > +async worker. The binding and >>>unbinding will >>> >>>>work like a >>> >>>> special >>> >>>> > GPU engine. >>> >>>> > >> > +The binding and unbinding operations are >>> serialized and >>> >>>> will >>> >>>> > wait on specified >>> >>>> > >> > +input fences before the operation >>>and will signal >>> the >>> >>>> output >>> >>>> > fences upon the >>> >>>> > >> > +completion of the operation. Due to >>> serialization, >>> >>>> completion of >>> >>>> > an operation >>> >>>> > >> > +will also indicate that all >>>previous operations >>> >>>>are also >>> >>>> > complete. >>> >>>> > >> >>> >>>> > >> I guess we should avoid saying "will >>>immediately >>> start >>> >>>> > binding/unbinding" if >>> >>>> > >> there are fences involved. >>> >>>> > >> >>> >>>> > >> And the fact that it's happening in an async >>> >>>>worker seem to >>> >>>> imply >>> >>>> > it's not >>> >>>> > >> immediate. >>> >>>> > >> >>> >>>> > >>> >>>> > Ok, will fix. >>> >>>> > This was added because in earlier design >>>binding was >>> deferred >>> >>>> until >>> >>>> > next execbuff. >>> >>>> > But now it is non-deferred (immediate in >>>that sense). >>> >>>>But yah, >>> >>>> this is >>> >>>> > confusing >>> >>>> > and will fix it. >>> >>>> > >>> >>>> > >> >>> >>>> > >> I have a question on the behavior of the bind >>> >>>>operation when >>> >>>> no >>> >>>> > input fence >>> >>>> > >> is provided. Let say I do : >>> >>>> > >> >>> >>>> > >> VM_BIND (out_fence=fence1) >>> >>>> > >> >>> >>>> > >> VM_BIND (out_fence=fence2) >>> >>>> > >> >>> >>>> > >> VM_BIND (out_fence=fence3) >>> >>>> > >> >>> >>>> > >> >>> >>>> > >> In what order are the fences going to >>>be signaled? >>> >>>> > >> >>> >>>> > >> In the order of VM_BIND ioctls? Or out >>>of order? >>> >>>> > >> >>> >>>> > >> Because you wrote "serialized I assume >>>it's : in >>> order >>> >>>> > >> >>> >>>> > >>> >>>> > Yes, in the order of VM_BIND/UNBIND >>>ioctls. Note that >>> >>>>bind and >>> >>>> unbind >>> >>>> > will use >>> >>>> > the same queue and hence are ordered. >>> >>>> > >>> >>>> > >> >>> >>>> > >> One thing I didn't realize is that >>>because we only >>> get one >>> >>>> > "VM_BIND" engine, >>> >>>> > >> there is a disconnect from the Vulkan >>>specification. >>> >>>> > >> >>> >>>> > >> In Vulkan VM_BIND operations are >>>serialized but >>> >>>>per engine. >>> >>>> > >> >>> >>>> > >> So you could have something like this : >>> >>>> > >> >>> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, >>> out_fence=fence2) >>> >>>> > >> >>> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, >>> out_fence=fence4) >>> >>>> > >> >>> >>>> > >> >>> >>>> > >> fence1 is not signaled >>> >>>> > >> >>> >>>> > >> fence3 is signaled >>> >>>> > >> >>> >>>> > >> So the second VM_BIND will proceed before the >>> >>>>first VM_BIND. >>> >>>> > >> >>> >>>> > >> >>> >>>> > >> I guess we can deal with that scenario in >>> >>>>userspace by doing >>> >>>> the >>> >>>> > wait >>> >>>> > >> ourselves in one thread per engines. >>> >>>> > >> >>> >>>> > >> But then it makes the VM_BIND input >>>fences useless. >>> >>>> > >> >>> >>>> > >> >>> >>>> > >> Daniel : what do you think? Should be >>>rework this or >>> just >>> >>>> deal with >>> >>>> > wait >>> >>>> > >> fences in userspace? >>> >>>> > >> >>> >>>> > > >>> >>>> > >My opinion is rework this but make the >>>ordering via >>> >>>>an engine >>> >>>> param >>> >>>> > optional. >>> >>>> > > >>> >>>> > >e.g. A VM can be configured so all binds >>>are ordered >>> >>>>within the >>> >>>> VM >>> >>>> > > >>> >>>> > >e.g. A VM can be configured so all binds >>>accept an >>> engine >>> >>>> argument >>> >>>> > (in >>> >>>> > >the case of the i915 likely this is a >>>gem context >>> >>>>handle) and >>> >>>> binds >>> >>>> > >ordered with respect to that engine. >>> >>>> > > >>> >>>> > >This gives UMDs options as the later >>>likely consumes >>> >>>>more KMD >>> >>>> > resources >>> >>>> > >so if a different UMD can live with
binds being
>>> >>>>ordered within >>> >>>> the VM >>> >>>> > >they can use a mode consuming less resources. >>> >>>> > > >>> >>>> > >>> >>>> > I think we need to be careful here if we >>>are looking >>> for some >>> >>>> out of >>> >>>> > (submission) order completion of vm_bind/unbind. >>> >>>> > In-order completion means, in a batch of >>>binds and >>> >>>>unbinds to be >>> >>>> > completed in-order, user only needs to specify >>> >>>>in-fence for the >>> >>>> > first bind/unbind call and the our-fence >>>for the last >>> >>>> bind/unbind >>> >>>> > call. Also, the VA released by an unbind >>>call can be >>> >>>>re-used by >>> >>>> > any subsequent bind call in that
in-order batch.
>>> >>>> > >>> >>>> > These things will break if >>>binding/unbinding were to >>> >>>>be allowed >>> >>>> to >>> >>>> > go out of order (of submission) and user >>>need to be >>> extra >>> >>>> careful >>> >>>> > not to run into pre-mature triggereing of >>>out-fence and >>> bind >>> >>>> failing >>> >>>> > as VA is still in use etc. >>> >>>> > >>> >>>> > Also, VM_BIND binds the provided
mapping on the
>>> specified >>> >>>> address >>> >>>> > space >>> >>>> > (VM). So, the uapi is not engine/context >>>specific. >>> >>>> > >>> >>>> > We can however add a 'queue' to the uapi >>>which can be >>> >>>>one from >>> >>>> the >>> >>>> > pre-defined queues, >>> >>>> > I915_VM_BIND_QUEUE_0 >>> >>>> > I915_VM_BIND_QUEUE_1 >>> >>>> > ... >>> >>>> > I915_VM_BIND_QUEUE_(N-1) >>> >>>> > >>> >>>> > KMD will spawn an async work queue for >>>each queue which >>> will >>> >>>> only >>> >>>> > bind the mappings on that queue in the
order of
>>> submission. >>> >>>> > User can assign the queue to per engine >>>or anything >>> >>>>like that. >>> >>>> > >>> >>>> > But again here, user need to be
careful and not
>>> >>>>deadlock these >>> >>>> > queues with circular dependency of fences. >>> >>>> > >>> >>>> > I prefer adding this later an as >>>extension based on >>> >>>>whether it >>> >>>> > is really helping with the implementation. >>> >>>> > >>> >>>> > I can tell you right now that having >>>everything on a >>> single >>> >>>> in-order >>> >>>> > queue will not get us the perf we want. >>>What vulkan >>> >>>>really wants >>> >>>> is one >>> >>>> > of two things: >>> >>>> > 1. No implicit ordering of VM_BIND
ops. They just
>>> happen in >>> >>>> whatever >>> >>>> > their dependencies are resolved and we >>>ensure ordering >>> >>>>ourselves >>> >>>> by >>> >>>> > having a syncobj in the VkQueue. >>> >>>> > 2. The ability to create multiple VM_BIND >>>queues. We >>> need at >>> >>>> least 2 >>> >>>> > but I don't see why there needs to be a >>>limit besides >>> >>>>the limits >>> >>>> the >>> >>>> > i915 API already has on the number of >>>engines. Vulkan >>> could >>> >>>> expose >>> >>>> > multiple sparse binding queues to the >>>client if it's not >>> >>>> arbitrarily >>> >>>> > limited. >>> >>>> >>> >>>> Thanks Jason, Lionel. >>> >>>> >>> >>>> Jason, what are you referring to when you say >>>"limits the i915 >>> API >>> >>>> already >>> >>>> has on the number of engines"? I am not sure if >>>there is such >>> an uapi >>> >>>> today. >>> >>>> >>> >>>> There's a limit of something like 64 total engines >>>today based on >>> the >>> >>>> number of bits we can cram into the exec flags in >>>execbuffer2. I >>> think >>> >>>> someone had an extended version that allowed more >>>but I ripped it >>> out >>> >>>> because no one was using it. Of course, >>>execbuffer3 might not >>> >>>>have that >>> >>>> problem at all. >>> >>>> >>> >>> >>> >>>Thanks Jason. >>> >>>Ok, I am not sure which exec flag is that, but yah, >>>execbuffer3 >>> probably >>> >>>will not have this limiation. So, we need to define a >>> VM_BIND_MAX_QUEUE >>> >>>and somehow export it to user (I am thinking of >>>embedding it in >>> >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND,
bits[1-3]->'n'
>>> meaning 2^n >>> >>>queues. >>> >> >>> >>Ah, I think you are waking about I915_EXEC_RING_MASK >>>(0x3f) which >>> execbuf3 >>> >>> Yup! That's exactly the limit I was talking about. >>> >>> >>will also have. So, we can simply define in vm_bind/unbind >>> structures, >>> >> >>> >>#define I915_VM_BIND_MAX_QUEUE 64 >>> >> __u32 queue; >>> >> >>> >>I think that will keep things simple. >>> > >>> >Hmmm? What does execbuf2 limit has to do with how
many engines
>>> >hardware can have? I suggest not to do that. >>> > >>> >Change with added this: >>> > >>> > if (set.num_engines > I915_EXEC_RING_MASK + 1) >>> > return -EINVAL; >>> > >>> >To context creation needs to be undone and so let users >>>create engine >>> >maps with all hardware engines, and let execbuf3 access >>>them all. >>> > >>> >>> Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to >>>execbuff3 also. >>> Hence, I was using the same limit for VM_BIND queues >>>(64, or 65 if we >>> make it N+1). >>> But, as discussed in other thread of this RFC series, we >>>are planning >>> to drop this I915_EXEC_RING_MAP in execbuff3. So,
there won't be
>>> any uapi that limits the number of engines (and hence >>>the vm_bind >>> queues >>> need to be supported). >>> >>> If we leave the number of vm_bind queues to be
arbitrarily large
>>> (__u32 queue_idx) then, we need to have a hashmap for >>>queue (a wq, >>> work_item and a linked list) lookup from the user >>>specified queue >>> index. >>> Other option is to just put some hard limit (say 64 or >>>65) and use >>> an array of queues in VM (each created upon first use). >>>I prefer this. >>> >>> I don't get why a VM_BIND queue is any different from any >>>other queue or >>> userspace-visible kernel object. But I'll leave those >>>details up to >>> danvet or whoever else might be reviewing the implementation. >>> --Jason >>> >>> I kind of agree here. Wouldn't be simpler to have the bind >>>queue created >>> like the others when we build the engine map? >>> >>> For userspace it's then just matter of selecting the right >>>queue ID when >>> submitting. >>> >>> If there is ever a possibility to have this work on the GPU, >>>it would be >>> all ready. >>> >> >>I did sync offline with Matt Brost on this. >>We can add a VM_BIND engine class and let user create VM_BIND >>engines (queues). >>The problem is, in i915 engine creating interface is bound to >>gem_context. >>So, in vm_bind ioctl, we would need both context_id and >>queue_idx for proper >>lookup of the user created engine. This is bit ackward as
vm_bind is an
>>interface to VM (address space) and has nothing to do with
gem_context.
> > >A gem_context has a single vm object right? > >Set through I915_CONTEXT_PARAM_VM at creation or given a default >one if not. > >So it's just like picking up the vm like it's done at execbuffer >time right now : eb->context->vm >
Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained from the context?
Yes, because if we go for engines, they're associated with a context and so also associated with the VM bound to the context.
Hmm...context doesn't sould like the right interface. It should be VM and engine (independent of context). Engine can be virtual or soft engine (kernel thread), each with its own queue. We can add an interface to create such engines (independent of context). But we are anway implicitly creating it when user uses a new queue_idx. If in future we have hardware engines for VM_BIND operation, we can have that explicit inteface to create engine instances and the queue_index in vm_bind/unbind will point to those engines. Anyone has any thoughts? Daniel?
Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.
So a cleaner interface to me is: user space create a vm, create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.
I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.
From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.
I didn't completely follow the discussion here. Just share some thoughts.
Yah, I agree.
Lionel, How about we define the queue as union { __u32 queue_idx; __u64 rsvd; }
If required, we can extend by expanding the 'rsvd' field to <ctx_id, queue_idx> later with a flag.
Niranjana
I did not really understand Oak's comment nor what you're suggesting here to be honest.
First the GEM context is already exposed to userspace. It's explicitly created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.
We give the GEM context id in every execbuffer we do with drm_i915_gem_execbuffer2::rsvd1.
It's still in the new execbuffer3 proposal being discussed.
Second, the GEM context is also where we set the VM with I915_CONTEXT_PARAM_VM.
Third, the GEM context also has the list of engines with I915_CONTEXT_PARAM_ENGINES.
Yes, the execbuf and engine map creation are tied to gem_context. (which probably is not the best interface.)
So it makes sense to me to dispatch the vm_bind operation to a GEM context, to a given vm_bind queue, because it's got all the information required :
- the list of new vm_bind queues
- the vm that is going to be modified
But the operation is performed here on the address space (VM) which can have multiple gem_contexts referring to it. So, VM is the right interface here. We need not 'gem_context'ify it.
All we need is multiple queue support for the address space (VM). Going to gem_context for that just because we have engine creation support there seems unnecessay and not correct to me.
Otherwise where do the vm_bind queues live?
In the i915/drm fd object?
That would mean that all the GEM contexts are sharing the same vm_bind queues.
Not all, only the gem contexts that are using the same address space (VM). But to me the right way to describe would be that "VM will be using those queues".
Niranjana
intel_context or GuC are internal details we're not concerned about.
I don't really see the connection with the GEM context.
Maybe Oak has a different use case than Vulkan.
-Lionel
Regards, Oak
Niranjana
I think the interface is clean as a interface to VM. It is
only that we
don't have a clean way to create a raw VM_BIND engine (not associated with any context) with i915 uapi. May be we can add such an interface, but I don't think that is
worth it
(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned above). Anyone has any thoughts?
> >>Another problem is, if two VMs are binding with the same defined >>engine, >>binding on VM1 can get unnecessary blocked by binding on VM2 >>(which may be >>waiting on its in_fence). > > >Maybe I'm missing something, but how can you have 2 vm objects >with a single gem_context right now? >
No, we don't have 2 VMs for a gem_context. Say if ctx1 with vm1 and ctx2 with vm2. First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If those two queue indicies points to same underlying vm_bind engine, then the second vm_bind call gets blocked until the first
vm_bind call's
'in' fence is triggered and bind completes.
With per VM queues, this is not a problem as two VMs will not endup sharing same queue.
BTW, I just posted a updated PATCH series. https://www.spinics.net/lists/dri-devel/msg350483.html
Niranjana
> >> >>So, my preference here is to just add a 'u32 queue' index in >>vm_bind/unbind >>ioctl, and the queues are per VM. >> >>Niranjana >> >>> Thanks, >>> >>> -Lionel >>> >>> >>> Niranjana >>> >>> >Regards, >>> > >>> >Tvrtko >>> > >>> >> >>> >>Niranjana >>> >> >>> >>> >>> >>>> I am trying to see how many queues we need and >>>don't want it to >>> be >>> >>>> arbitrarily >>> >>>> large and unduely blow up memory usage and >>>complexity in i915 >>> driver. >>> >>>> >>> >>>> I expect a Vulkan driver to use at most 2 in the >>>vast majority >>> >>>>of cases. I >>> >>>> could imagine a client wanting to create more
than 1 sparse
>>> >>>>queue in which >>> >>>> case, it'll be N+1 but that's unlikely. As far as >>>complexity >>> >>>>goes, once >>> >>>> you allow two, I don't think the complexity is
going up by
>>> >>>>allowing N. As >>> >>>> for memory usage, creating more queues means more >>>memory. That's >>> a >>> >>>> trade-off that userspace can make. Again, the >>>expected number >>> >>>>here is 1 >>> >>>> or 2 in the vast majority of cases so I don't think >>>you need to >>> worry. >>> >>> >>> >>>Ok, will start with n=3 meaning 8 queues. >>> >>>That would require us create 8 workqueues. >>> >>>We can change 'n' later if required. >>> >>> >>> >>>Niranjana >>> >>> >>> >>>> >>> >>>> > Why? Because Vulkan has two basic kind of bind >>> >>>>operations and we >>> >>>> don't >>> >>>> > want any dependencies between them: >>> >>>> > 1. Immediate. These happen right after BO >>>creation or >>> >>>>maybe as >>> >>>> part of >>> >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >>> >>>>don't happen >>> >>>> on a >>> >>>> > queue and we don't want them serialized >>>with anything. To >>> >>>> synchronize >>> >>>> > with submit, we'll have a syncobj in the >>>VkDevice which >>> is >>> >>>> signaled by >>> >>>> > all immediate bind operations and make >>>submits wait on >>> it. >>> >>>> > 2. Queued (sparse): These happen on a >>>VkQueue which may >>> be the >>> >>>> same as >>> >>>> > a render/compute queue or may be its own >>>queue. It's up >>> to us >>> >>>> what we >>> >>>> > want to advertise. From the Vulkan API >>>PoV, this is like >>> any >>> >>>> other >>> >>>> > queue. Operations on it wait on and signal >>>semaphores. If we >>> >>>> have a >>> >>>> > VM_BIND engine, we'd provide syncobjs to
wait and
>>> >>>>signal just like >>> >>>> we do >>> >>>> > in execbuf(). >>> >>>> > The important thing is that we don't want >>>one type of >>> >>>>operation to >>> >>>> block >>> >>>> > on the other. If immediate binds are >>>blocking on sparse >>> binds, >>> >>>> it's >>> >>>> > going to cause over-synchronization issues. >>> >>>> > In terms of the internal implementation, I >>>know that >>> >>>>there's going >>> >>>> to be >>> >>>> > a lock on the VM and that we can't actually >>>do these >>> things in >>> >>>> > parallel. That's fine. Once the dma_fences have >>> signaled and >>> >>>> we're >>> >>>> >>> >>>> Thats correct. It is like a single VM_BIND
engine with
>>> >>>>multiple queues >>> >>>> feeding to it. >>> >>>> >>> >>>> Right. As long as the queues themselves are >>>independent and >>> >>>>can block on >>> >>>> dma_fences without holding up other queues, I think >>>we're fine. >>> >>>> >>> >>>> > unblocked to do the bind operation, I don't care if >>> >>>>there's a bit >>> >>>> of >>> >>>> > synchronization due to locking. That's >>>expected. What >>> >>>>we can't >>> >>>> afford >>> >>>> > to have is an immediate bind operation >>>suddenly blocking >>> on a >>> >>>> sparse >>> >>>> > operation which is blocked on a compute job >>>that's going >>> to run >>> >>>> for >>> >>>> > another 5ms. >>> >>>> >>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM >>>doesn't block >>> the >>> >>>> VM_BIND >>> >>>> on other VMs. I am not sure about usecases
here, but just
>>> wanted to >>> >>>> clarify. >>> >>>> >>> >>>> Yes, that's what I would expect. >>> >>>> --Jason >>> >>>> >>> >>>> Niranjana >>> >>>> >>> >>>> > For reference, Windows solves this by allowing >>> arbitrarily many >>> >>>> paging >>> >>>> > queues (what they call a VM_BIND >>>engine/queue). That >>> >>>>design works >>> >>>> > pretty well and solves the problems in >>>question. >>>>Again, we could >>> >>>> just >>> >>>> > make everything out-of-order and require >>>using syncobjs >>> >>>>to order >>> >>>> things >>> >>>> > as userspace wants. That'd be fine too. >>> >>>> > One more note while I'm here: danvet said >>>something on >>> >>>>IRC about >>> >>>> VM_BIND >>> >>>> > queues waiting for syncobjs to >>>materialize. We don't >>> really >>> >>>> want/need >>> >>>> > this. We already have all the machinery in >>>userspace to >>> handle >>> >>>> > wait-before-signal and waiting for syncobj >>>fences to >>> >>>>materialize >>> >>>> and >>> >>>> > that machinery is on by default. It
would actually
>>> >>>>take MORE work >>> >>>> in >>> >>>> > Mesa to turn it off and take advantage of >>>the kernel >>> >>>>being able to >>> >>>> wait >>> >>>> > for syncobjs to materialize. Also, getting >>>that right is >>> >>>> ridiculously >>> >>>> > hard and I really don't want to get it >>>wrong in kernel >>> >>>>space. �� When we >>> >>>> > do memory fences, wait-before-signal will >>>be a thing. We >>> don't >>> >>>> need to >>> >>>> > try and make it a thing for syncobj. >>> >>>> > --Jason >>> >>>> > >>> >>>> > Thanks Jason, >>> >>>> > >>> >>>> > I missed the bit in the Vulkan spec that >>>we're allowed to >>> have a >>> >>>> sparse >>> >>>> > queue that does not implement either graphics >>>or compute >>> >>>>operations >>> >>>> : >>> >>>> > >>> >>>> > "While some implementations may include >>> >>>> VK_QUEUE_SPARSE_BINDING_BIT >>> >>>> > support in queue families that also include >>> >>>> > >>> >>>> > graphics and compute support, other >>>implementations may >>> only >>> >>>> expose a >>> >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>> >>>> > >>> >>>> > family." >>> >>>> > >>> >>>> > So it can all be all a vm_bind engine that
just does
>>> bind/unbind >>> >>>> > operations. >>> >>>> > >>> >>>> > But yes we need another engine for the >>>immediate/non-sparse >>> >>>> operations. >>> >>>> > >>> >>>> > -Lionel >>> >>>> > >>> >>>> > > >>> >>>> > Daniel, any thoughts? >>> >>>> > >>> >>>> > Niranjana >>> >>>> > >>> >>>> > >Matt >>> >>>> > > >>> >>>> > >> >>> >>>> > >> Sorry I noticed this late. >>> >>>> > >> >>> >>>> > >> >>> >>>> > >> -Lionel >>> >>>> > >> >>> >>>> > >> > >
Thanks, Oak
-----Original Message----- From: Vishwanathapura, Niranjana niranjana.vishwanathapura@intel.com Sent: June 14, 2022 1:02 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Zeng, Oak oak.zeng@intel.com; Intel GFX <intel- gfx@lists.freedesktop.org>; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P chris.p.wilson@intel.com; Vetter, Daniel daniel.vetter@intel.com; Christian König christian.koenig@amd.com Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:
On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
Regards, Oak
-----Original Message----- From: Intel-gfx intel-gfx-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: June 10, 2022 1:43 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Intel GFX intel-gfx@lists.freedesktop.org; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P chris.p.wilson@intel.com; Vetter, Daniel daniel.vetter@intel.com; Christian König
Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:
On 10/06/2022 10:54, Niranjana Vishwanathapura wrote: >On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote: >>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote: >>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote: >>>> On 09/06/2022 00:55, Jason Ekstrand wrote: >>>> >>>> On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura >>>> niranjana.vishwanathapura@intel.com wrote: >>>> >>>> On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko
Ursulin wrote:
>>>> > >>>> > >>>> >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>>> >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana >>>>Vishwanathapura >>>> wrote: >>>> >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason >>>>Ekstrand wrote: >>>> >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana
Vishwanathapura
>>>> >>>> niranjana.vishwanathapura@intel.com wrote: >>>> >>>> >>>> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel >>>>Landwerlin >>>> wrote: >>>> >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: >>>> >>>> > >>>> >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana >>>>Vishwanathapura >>>> >>>> > niranjana.vishwanathapura@intel.com wrote: >>>> >>>> > >>>> >>>> > On Wed, Jun 01, 2022 at 01:28:36PM
-0700, Matthew
>>>> >>>>Brost wrote: >>>> >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM
+0300, Lionel
>>>> Landwerlin >>>> >>>> wrote: >>>> >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura >>>> wrote: >>>> >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start >>>> >>>> binding/unbinding >>>> >>>> > the mapping in an >>>> >>>> > >> > +async worker. The binding and >>>>unbinding will >>>> >>>>work like a >>>> >>>> special >>>> >>>> > GPU engine. >>>> >>>> > >> > +The binding and unbinding operations are >>>> serialized and >>>> >>>> will >>>> >>>> > wait on specified >>>> >>>> > >> > +input fences before the operation >>>>and will signal >>>> the >>>> >>>> output >>>> >>>> > fences upon the >>>> >>>> > >> > +completion of the operation. Due to >>>> serialization, >>>> >>>> completion of >>>> >>>> > an operation >>>> >>>> > >> > +will also indicate that all >>>>previous operations >>>> >>>>are also >>>> >>>> > complete. >>>> >>>> > >> >>>> >>>> > >> I guess we should avoid saying "will >>>>immediately >>>> start >>>> >>>> > binding/unbinding" if >>>> >>>> > >> there are fences involved. >>>> >>>> > >> >>>> >>>> > >> And the fact that it's happening in an async >>>> >>>>worker seem to >>>> >>>> imply >>>> >>>> > it's not >>>> >>>> > >> immediate. >>>> >>>> > >> >>>> >>>> > >>>> >>>> > Ok, will fix. >>>> >>>> > This was added because in earlier design >>>>binding was >>>> deferred >>>> >>>> until >>>> >>>> > next execbuff. >>>> >>>> > But now it is non-deferred (immediate in >>>>that sense). >>>> >>>>But yah, >>>> >>>> this is >>>> >>>> > confusing >>>> >>>> > and will fix it. >>>> >>>> > >>>> >>>> > >> >>>> >>>> > >> I have a question on the behavior of the bind >>>> >>>>operation when >>>> >>>> no >>>> >>>> > input fence >>>> >>>> > >> is provided. Let say I do : >>>> >>>> > >> >>>> >>>> > >> VM_BIND (out_fence=fence1) >>>> >>>> > >> >>>> >>>> > >> VM_BIND (out_fence=fence2) >>>> >>>> > >> >>>> >>>> > >> VM_BIND (out_fence=fence3) >>>> >>>> > >> >>>> >>>> > >> >>>> >>>> > >> In what order are the fences going to >>>>be signaled? >>>> >>>> > >> >>>> >>>> > >> In the order of VM_BIND ioctls? Or out >>>>of order? >>>> >>>> > >> >>>> >>>> > >> Because you wrote "serialized I assume >>>>it's : in >>>> order >>>> >>>> > >> >>>> >>>> > >>>> >>>> > Yes, in the order of VM_BIND/UNBIND >>>>ioctls. Note that >>>> >>>>bind and >>>> >>>> unbind >>>> >>>> > will use >>>> >>>> > the same queue and hence are ordered. >>>> >>>> > >>>> >>>> > >> >>>> >>>> > >> One thing I didn't realize is that >>>>because we only >>>> get one >>>> >>>> > "VM_BIND" engine, >>>> >>>> > >> there is a disconnect from the Vulkan >>>>specification. >>>> >>>> > >> >>>> >>>> > >> In Vulkan VM_BIND operations are >>>>serialized but >>>> >>>>per engine. >>>> >>>> > >> >>>> >>>> > >> So you could have something like this : >>>> >>>> > >> >>>> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, >>>> out_fence=fence2) >>>> >>>> > >> >>>> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, >>>> out_fence=fence4) >>>> >>>> > >> >>>> >>>> > >> >>>> >>>> > >> fence1 is not signaled >>>> >>>> > >> >>>> >>>> > >> fence3 is signaled >>>> >>>> > >> >>>> >>>> > >> So the second VM_BIND will proceed before the >>>> >>>>first VM_BIND. >>>> >>>> > >> >>>> >>>> > >> >>>> >>>> > >> I guess we can deal with that scenario in >>>> >>>>userspace by doing >>>> >>>> the >>>> >>>> > wait >>>> >>>> > >> ourselves in one thread per engines. >>>> >>>> > >> >>>> >>>> > >> But then it makes the VM_BIND input >>>>fences useless. >>>> >>>> > >> >>>> >>>> > >> >>>> >>>> > >> Daniel : what do you think? Should be >>>>rework this or >>>> just >>>> >>>> deal with >>>> >>>> > wait >>>> >>>> > >> fences in userspace? >>>> >>>> > >> >>>> >>>> > > >>>> >>>> > >My opinion is rework this but make the >>>>ordering via >>>> >>>>an engine >>>> >>>> param >>>> >>>> > optional. >>>> >>>> > > >>>> >>>> > >e.g. A VM can be configured so all binds >>>>are ordered >>>> >>>>within the >>>> >>>> VM >>>> >>>> > > >>>> >>>> > >e.g. A VM can be configured so all binds >>>>accept an >>>> engine >>>> >>>> argument >>>> >>>> > (in >>>> >>>> > >the case of the i915 likely this is a >>>>gem context >>>> >>>>handle) and >>>> >>>> binds >>>> >>>> > >ordered with respect to that engine. >>>> >>>> > > >>>> >>>> > >This gives UMDs options as the later >>>>likely consumes >>>> >>>>more KMD >>>> >>>> > resources >>>> >>>> > >so if a different UMD can live with
binds being
>>>> >>>>ordered within >>>> >>>> the VM >>>> >>>> > >they can use a mode consuming less resources. >>>> >>>> > > >>>> >>>> > >>>> >>>> > I think we need to be careful here if we >>>>are looking >>>> for some >>>> >>>> out of >>>> >>>> > (submission) order completion of vm_bind/unbind. >>>> >>>> > In-order completion means, in a batch of >>>>binds and >>>> >>>>unbinds to be >>>> >>>> > completed in-order, user only needs to specify >>>> >>>>in-fence for the >>>> >>>> > first bind/unbind call and the our-fence >>>>for the last >>>> >>>> bind/unbind >>>> >>>> > call. Also, the VA released by an unbind >>>>call can be >>>> >>>>re-used by >>>> >>>> > any subsequent bind call in that
in-order batch.
>>>> >>>> > >>>> >>>> > These things will break if >>>>binding/unbinding were to >>>> >>>>be allowed >>>> >>>> to >>>> >>>> > go out of order (of submission) and user >>>>need to be >>>> extra >>>> >>>> careful >>>> >>>> > not to run into pre-mature triggereing of >>>>out-fence and >>>> bind >>>> >>>> failing >>>> >>>> > as VA is still in use etc. >>>> >>>> > >>>> >>>> > Also, VM_BIND binds the provided
mapping on the
>>>> specified >>>> >>>> address >>>> >>>> > space >>>> >>>> > (VM). So, the uapi is not engine/context >>>>specific. >>>> >>>> > >>>> >>>> > We can however add a 'queue' to the uapi >>>>which can be >>>> >>>>one from >>>> >>>> the >>>> >>>> > pre-defined queues, >>>> >>>> > I915_VM_BIND_QUEUE_0 >>>> >>>> > I915_VM_BIND_QUEUE_1 >>>> >>>> > ... >>>> >>>> > I915_VM_BIND_QUEUE_(N-1) >>>> >>>> > >>>> >>>> > KMD will spawn an async work queue for >>>>each queue which >>>> will >>>> >>>> only >>>> >>>> > bind the mappings on that queue in the
order of
>>>> submission. >>>> >>>> > User can assign the queue to per engine >>>>or anything >>>> >>>>like that. >>>> >>>> > >>>> >>>> > But again here, user need to be
careful and not
>>>> >>>>deadlock these >>>> >>>> > queues with circular dependency of fences. >>>> >>>> > >>>> >>>> > I prefer adding this later an as >>>>extension based on >>>> >>>>whether it >>>> >>>> > is really helping with the implementation. >>>> >>>> > >>>> >>>> > I can tell you right now that having >>>>everything on a >>>> single >>>> >>>> in-order >>>> >>>> > queue will not get us the perf we want. >>>>What vulkan >>>> >>>>really wants >>>> >>>> is one >>>> >>>> > of two things: >>>> >>>> > 1. No implicit ordering of VM_BIND
ops. They just
>>>> happen in >>>> >>>> whatever >>>> >>>> > their dependencies are resolved and we >>>>ensure ordering >>>> >>>>ourselves >>>> >>>> by >>>> >>>> > having a syncobj in the VkQueue. >>>> >>>> > 2. The ability to create multiple VM_BIND >>>>queues. We >>>> need at >>>> >>>> least 2 >>>> >>>> > but I don't see why there needs to be a >>>>limit besides >>>> >>>>the limits >>>> >>>> the >>>> >>>> > i915 API already has on the number of >>>>engines. Vulkan >>>> could >>>> >>>> expose >>>> >>>> > multiple sparse binding queues to the >>>>client if it's not >>>> >>>> arbitrarily >>>> >>>> > limited. >>>> >>>> >>>> >>>> Thanks Jason, Lionel. >>>> >>>> >>>> >>>> Jason, what are you referring to when you say >>>>"limits the i915 >>>> API >>>> >>>> already >>>> >>>> has on the number of engines"? I am not sure if >>>>there is such >>>> an uapi >>>> >>>> today. >>>> >>>> >>>> >>>> There's a limit of something like 64 total engines >>>>today based on >>>> the >>>> >>>> number of bits we can cram into the exec flags in >>>>execbuffer2. I >>>> think >>>> >>>> someone had an extended version that allowed more >>>>but I ripped it >>>> out >>>> >>>> because no one was using it. Of course, >>>>execbuffer3 might not >>>> >>>>have that >>>> >>>> problem at all. >>>> >>>> >>>> >>> >>>> >>>Thanks Jason. >>>> >>>Ok, I am not sure which exec flag is that, but yah, >>>>execbuffer3 >>>> probably >>>> >>>will not have this limiation. So, we need to define a >>>> VM_BIND_MAX_QUEUE >>>> >>>and somehow export it to user (I am thinking of >>>>embedding it in >>>> >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND,
bits[1-3]->'n'
>>>> meaning 2^n >>>> >>>queues. >>>> >> >>>> >>Ah, I think you are waking about I915_EXEC_RING_MASK >>>>(0x3f) which >>>> execbuf3 >>>> >>>> Yup! That's exactly the limit I was talking about. >>>> >>>> >>will also have. So, we can simply define in vm_bind/unbind >>>> structures, >>>> >> >>>> >>#define I915_VM_BIND_MAX_QUEUE 64 >>>> >> __u32 queue; >>>> >> >>>> >>I think that will keep things simple. >>>> > >>>> >Hmmm? What does execbuf2 limit has to do with how
many engines
>>>> >hardware can have? I suggest not to do that. >>>> > >>>> >Change with added this: >>>> > >>>> > if (set.num_engines > I915_EXEC_RING_MASK + 1) >>>> > return -EINVAL; >>>> > >>>> >To context creation needs to be undone and so let users >>>>create engine >>>> >maps with all hardware engines, and let execbuf3 access >>>>them all. >>>> > >>>> >>>> Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to >>>>execbuff3 also. >>>> Hence, I was using the same limit for VM_BIND queues >>>>(64, or 65 if we >>>> make it N+1). >>>> But, as discussed in other thread of this RFC series, we >>>>are planning >>>> to drop this I915_EXEC_RING_MAP in execbuff3. So,
there won't be
>>>> any uapi that limits the number of engines (and hence >>>>the vm_bind >>>> queues >>>> need to be supported). >>>> >>>> If we leave the number of vm_bind queues to be
arbitrarily large
>>>> (__u32 queue_idx) then, we need to have a hashmap for >>>>queue (a wq, >>>> work_item and a linked list) lookup from the user >>>>specified queue >>>> index. >>>> Other option is to just put some hard limit (say 64 or >>>>65) and use >>>> an array of queues in VM (each created upon first use). >>>>I prefer this. >>>> >>>> I don't get why a VM_BIND queue is any different from any >>>>other queue or >>>> userspace-visible kernel object. But I'll leave those >>>>details up to >>>> danvet or whoever else might be reviewing the
implementation.
>>>> --Jason >>>> >>>> I kind of agree here. Wouldn't be simpler to have the bind >>>>queue created >>>> like the others when we build the engine map? >>>> >>>> For userspace it's then just matter of selecting the right >>>>queue ID when >>>> submitting. >>>> >>>> If there is ever a possibility to have this work on the GPU, >>>>it would be >>>> all ready. >>>> >>> >>>I did sync offline with Matt Brost on this. >>>We can add a VM_BIND engine class and let user create VM_BIND >>>engines (queues). >>>The problem is, in i915 engine creating interface is bound to >>>gem_context. >>>So, in vm_bind ioctl, we would need both context_id and >>>queue_idx for proper >>>lookup of the user created engine. This is bit ackward as
vm_bind is an
>>>interface to VM (address space) and has nothing to do with
gem_context.
>> >> >>A gem_context has a single vm object right? >> >>Set through I915_CONTEXT_PARAM_VM at creation or given a
default
>>one if not. >> >>So it's just like picking up the vm like it's done at execbuffer >>time right now : eb->context->vm >> > >Are you suggesting replacing 'vm_id' with 'context_id' in the >VM_BIND/UNBIND >ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can
be
>obtained >from the context?
Yes, because if we go for engines, they're associated with a context and so also associated with the VM bound to the context.
Hmm...context doesn't sould like the right interface. It should be VM and engine (independent of context). Engine can be virtual or soft engine (kernel thread), each with its own queue. We can add an interface to create such engines (independent of context). But we are anway implicitly creating it when user uses a new queue_idx. If in future we have hardware engines for VM_BIND operation, we can have that explicit inteface to create engine instances and the queue_index in vm_bind/unbind will point to those engines. Anyone has any thoughts? Daniel?
Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.
So a cleaner interface to me is: user space create a vm, create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.
I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.
From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.
I didn't completely follow the discussion here. Just share some thoughts.
Yah, I agree.
Lionel, How about we define the queue as union { __u32 queue_idx; __u64 rsvd; }
If required, we can extend by expanding the 'rsvd' field to <ctx_id, queue_idx> later with a flag.
Niranjana
I did not really understand Oak's comment nor what you're suggesting here to be honest.
First the GEM context is already exposed to userspace. It's explicitly created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.
We give the GEM context id in every execbuffer we do with drm_i915_gem_execbuffer2::rsvd1.
It's still in the new execbuffer3 proposal being discussed.
Second, the GEM context is also where we set the VM with I915_CONTEXT_PARAM_VM.
Third, the GEM context also has the list of engines with I915_CONTEXT_PARAM_ENGINES.
Yes, the execbuf and engine map creation are tied to gem_context. (which probably is not the best interface.)
So it makes sense to me to dispatch the vm_bind operation to a GEM context, to a given vm_bind queue, because it's got all the information required :
- the list of new vm_bind queues
- the vm that is going to be modified
But the operation is performed here on the address space (VM) which can have multiple gem_contexts referring to it. So, VM is the right interface here. We need not 'gem_context'ify it.
All we need is multiple queue support for the address space (VM). Going to gem_context for that just because we have engine creation support there seems unnecessay and not correct to me.
Otherwise where do the vm_bind queues live?
In the i915/drm fd object?
That would mean that all the GEM contexts are sharing the same vm_bind queues.
Not all, only the gem contexts that are using the same address space (VM). But to me the right way to describe would be that "VM will be using those queues".
I hope by "queue" here you mean a HW resource that will be later used to execute the job, for example a ccs compute engine. Of course queue can be virtual so user can create more queues than what hw physically has.
To express the concept of "VM will be using those queues", I think it make sense to have create_queue(vm) function taking a vm parameter. This means this queue is created for the purpose of submit job under this VM. Later on, we can submit job (referring to objects vm_bound to the same vm) to the queue. The vm_bind ioctl doesn’t need to have queue parameter, just vm_bind (object, va, vm).
I hope the "queue" here is not the engine used to perform the vm_bind operation itself. But if you meant a queue/engine to perform vm_bind itself (vs a queue/engine for later job submission), then we can discuss more. I know xe driver have similar concept and I think align the design early can benefit the migration to xe driver.
Regards, Oak
Niranjana
intel_context or GuC are internal details we're not concerned about.
I don't really see the connection with the GEM context.
Maybe Oak has a different use case than Vulkan.
-Lionel
Regards, Oak
Niranjana
>I think the interface is clean as a interface to VM. It is
only that we
>don't have a clean way to create a raw VM_BIND engine (not >associated with >any context) with i915 uapi. >May be we can add such an interface, but I don't think that is
worth it
>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I >mentioned >above). >Anyone has any thoughts? > >> >>>Another problem is, if two VMs are binding with the same defined >>>engine, >>>binding on VM1 can get unnecessary blocked by binding on VM2 >>>(which may be >>>waiting on its in_fence). >> >> >>Maybe I'm missing something, but how can you have 2 vm objects >>with a single gem_context right now? >> > >No, we don't have 2 VMs for a gem_context. >Say if ctx1 with vm1 and ctx2 with vm2. >First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. >Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If >those two queue indicies points to same underlying vm_bind engine, >then the second vm_bind call gets blocked until the first
vm_bind call's
>'in' fence is triggered and bind completes. > >With per VM queues, this is not a problem as two VMs will not endup >sharing same queue. > >BTW, I just posted a updated PATCH series. >https://www.spinics.net/lists/dri-devel/msg350483.html > >Niranjana > >> >>> >>>So, my preference here is to just add a 'u32 queue' index in >>>vm_bind/unbind >>>ioctl, and the queues are per VM. >>> >>>Niranjana >>> >>>> Thanks, >>>> >>>> -Lionel >>>> >>>> >>>> Niranjana >>>> >>>> >Regards, >>>> > >>>> >Tvrtko >>>> > >>>> >> >>>> >>Niranjana >>>> >> >>>> >>> >>>> >>>> I am trying to see how many queues we need and >>>>don't want it to >>>> be >>>> >>>> arbitrarily >>>> >>>> large and unduely blow up memory usage and >>>>complexity in i915 >>>> driver. >>>> >>>> >>>> >>>> I expect a Vulkan driver to use at most 2 in the >>>>vast majority >>>> >>>>of cases. I >>>> >>>> could imagine a client wanting to create more
than 1 sparse
>>>> >>>>queue in which >>>> >>>> case, it'll be N+1 but that's unlikely. As far as >>>>complexity >>>> >>>>goes, once >>>> >>>> you allow two, I don't think the complexity is
going up by
>>>> >>>>allowing N. As >>>> >>>> for memory usage, creating more queues means more >>>>memory. That's >>>> a >>>> >>>> trade-off that userspace can make. Again, the >>>>expected number >>>> >>>>here is 1 >>>> >>>> or 2 in the vast majority of cases so I don't think >>>>you need to >>>> worry. >>>> >>> >>>> >>>Ok, will start with n=3 meaning 8 queues. >>>> >>>That would require us create 8 workqueues. >>>> >>>We can change 'n' later if required. >>>> >>> >>>> >>>Niranjana >>>> >>> >>>> >>>> >>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>> >>>>operations and we >>>> >>>> don't >>>> >>>> > want any dependencies between them: >>>> >>>> > 1. Immediate. These happen right after BO >>>>creation or >>>> >>>>maybe as >>>> >>>> part of >>>> >>>> > vkBindImageMemory() or
VkBindBufferMemory(). These
>>>> >>>>don't happen >>>> >>>> on a >>>> >>>> > queue and we don't want them serialized >>>>with anything. To >>>> >>>> synchronize >>>> >>>> > with submit, we'll have a syncobj in the >>>>VkDevice which >>>> is >>>> >>>> signaled by >>>> >>>> > all immediate bind operations and make >>>>submits wait on >>>> it. >>>> >>>> > 2. Queued (sparse): These happen on a >>>>VkQueue which may >>>> be the >>>> >>>> same as >>>> >>>> > a render/compute queue or may be its own >>>>queue. It's up >>>> to us >>>> >>>> what we >>>> >>>> > want to advertise. From the Vulkan API >>>>PoV, this is like >>>> any >>>> >>>> other >>>> >>>> > queue. Operations on it wait on and signal >>>>semaphores. If we >>>> >>>> have a >>>> >>>> > VM_BIND engine, we'd provide syncobjs to
wait and
>>>> >>>>signal just like >>>> >>>> we do >>>> >>>> > in execbuf(). >>>> >>>> > The important thing is that we don't want >>>>one type of >>>> >>>>operation to >>>> >>>> block >>>> >>>> > on the other. If immediate binds are >>>>blocking on sparse >>>> binds, >>>> >>>> it's >>>> >>>> > going to cause over-synchronization issues. >>>> >>>> > In terms of the internal implementation, I >>>>know that >>>> >>>>there's going >>>> >>>> to be >>>> >>>> > a lock on the VM and that we can't actually >>>>do these >>>> things in >>>> >>>> > parallel. That's fine. Once the dma_fences have >>>> signaled and >>>> >>>> we're >>>> >>>> >>>> >>>> Thats correct. It is like a single VM_BIND
engine with
>>>> >>>>multiple queues >>>> >>>> feeding to it. >>>> >>>> >>>> >>>> Right. As long as the queues themselves are >>>>independent and >>>> >>>>can block on >>>> >>>> dma_fences without holding up other queues, I think >>>>we're fine. >>>> >>>> >>>> >>>> > unblocked to do the bind operation, I don't care if >>>> >>>>there's a bit >>>> >>>> of >>>> >>>> > synchronization due to locking. That's >>>>expected. What >>>> >>>>we can't >>>> >>>> afford >>>> >>>> > to have is an immediate bind operation >>>>suddenly blocking >>>> on a >>>> >>>> sparse >>>> >>>> > operation which is blocked on a compute job >>>>that's going >>>> to run >>>> >>>> for >>>> >>>> > another 5ms. >>>> >>>> >>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one
VM
>>>>doesn't block >>>> the >>>> >>>> VM_BIND >>>> >>>> on other VMs. I am not sure about usecases
here, but just
>>>> wanted to >>>> >>>> clarify. >>>> >>>> >>>> >>>> Yes, that's what I would expect. >>>> >>>> --Jason >>>> >>>> >>>> >>>> Niranjana >>>> >>>> >>>> >>>> > For reference, Windows solves this by allowing >>>> arbitrarily many >>>> >>>> paging >>>> >>>> > queues (what they call a VM_BIND >>>>engine/queue). That >>>> >>>>design works >>>> >>>> > pretty well and solves the problems in >>>>question. >>>>Again, we could >>>> >>>> just >>>> >>>> > make everything out-of-order and require >>>>using syncobjs >>>> >>>>to order >>>> >>>> things >>>> >>>> > as userspace wants. That'd be fine too. >>>> >>>> > One more note while I'm here: danvet said >>>>something on >>>> >>>>IRC about >>>> >>>> VM_BIND >>>> >>>> > queues waiting for syncobjs to >>>>materialize. We don't >>>> really >>>> >>>> want/need >>>> >>>> > this. We already have all the machinery in >>>>userspace to >>>> handle >>>> >>>> > wait-before-signal and waiting for syncobj >>>>fences to >>>> >>>>materialize >>>> >>>> and >>>> >>>> > that machinery is on by default. It
would actually
>>>> >>>>take MORE work >>>> >>>> in >>>> >>>> > Mesa to turn it off and take advantage of >>>>the kernel >>>> >>>>being able to >>>> >>>> wait >>>> >>>> > for syncobjs to materialize. Also, getting >>>>that right is >>>> >>>> ridiculously >>>> >>>> > hard and I really don't want to get it >>>>wrong in kernel >>>> >>>>space. �� When we >>>> >>>> > do memory fences, wait-before-signal will >>>>be a thing. We >>>> don't >>>> >>>> need to >>>> >>>> > try and make it a thing for syncobj. >>>> >>>> > --Jason >>>> >>>> > >>>> >>>> > Thanks Jason, >>>> >>>> > >>>> >>>> > I missed the bit in the Vulkan spec that >>>>we're allowed to >>>> have a >>>> >>>> sparse >>>> >>>> > queue that does not implement either graphics >>>>or compute >>>> >>>>operations >>>> >>>> : >>>> >>>> > >>>> >>>> > "While some implementations may include >>>> >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>> >>>> > support in queue families that also include >>>> >>>> > >>>> >>>> > graphics and compute support, other >>>>implementations may >>>> only >>>> >>>> expose a >>>> >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>> >>>> > >>>> >>>> > family." >>>> >>>> > >>>> >>>> > So it can all be all a vm_bind engine that
just does
>>>> bind/unbind >>>> >>>> > operations. >>>> >>>> > >>>> >>>> > But yes we need another engine for the >>>>immediate/non-sparse >>>> >>>> operations. >>>> >>>> > >>>> >>>> > -Lionel >>>> >>>> > >>>> >>>> > > >>>> >>>> > Daniel, any thoughts? >>>> >>>> > >>>> >>>> > Niranjana >>>> >>>> > >>>> >>>> > >Matt >>>> >>>> > > >>>> >>>> > >> >>>> >>>> > >> Sorry I noticed this late. >>>> >>>> > >> >>>> >>>> > >> >>>> >>>> > >> -Lionel >>>> >>>> > >> >>>> >>>> > >> >> >>
Thanks, Oak
-----Original Message----- From: dri-devel dri-devel-bounces@lists.freedesktop.org On Behalf Of Zeng, Oak Sent: June 14, 2022 5:13 PM To: Vishwanathapura, Niranjana niranjana.vishwanathapura@intel.com; Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Intel GFX intel-gfx@lists.freedesktop.org; Wilson, Chris P chris.p.wilson@intel.com; Hellstrom, Thomas thomas.hellstrom@intel.com; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Vetter, Daniel daniel.vetter@intel.com; Christian König christian.koenig@amd.com Subject: RE: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
Thanks, Oak
-----Original Message----- From: Vishwanathapura, Niranjana niranjana.vishwanathapura@intel.com Sent: June 14, 2022 1:02 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Zeng, Oak oak.zeng@intel.com; Intel GFX <intel- gfx@lists.freedesktop.org>; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P
Vetter, Daniel daniel.vetter@intel.com; Christian König christian.koenig@amd.com Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:
On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:
On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:
Regards, Oak
-----Original Message----- From: Intel-gfx intel-gfx-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: June 10, 2022 1:43 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Intel GFX intel-gfx@lists.freedesktop.org; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P chris.p.wilson@intel.com; Vetter, Daniel daniel.vetter@intel.com; Christian König
Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote: >On 10/06/2022 10:54, Niranjana Vishwanathapura wrote: >>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote: >>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote: >>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin
wrote:
>>>>> On 09/06/2022 00:55, Jason Ekstrand wrote: >>>>> >>>>> On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura >>>>> niranjana.vishwanathapura@intel.com wrote: >>>>> >>>>> On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote: >>>>> > >>>>> > >>>>> >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>>>> >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana >>>>>Vishwanathapura >>>>> wrote: >>>>> >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason >>>>>Ekstrand wrote: >>>>> >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura >>>>> >>>> niranjana.vishwanathapura@intel.com wrote: >>>>> >>>> >>>>> >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel >>>>>Landwerlin >>>>> wrote: >>>>> >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: >>>>> >>>> > >>>>> >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana >>>>>Vishwanathapura >>>>> >>>> > niranjana.vishwanathapura@intel.com wrote: >>>>> >>>> > >>>>> >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew >>>>> >>>>Brost wrote: >>>>> >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel >>>>> Landwerlin >>>>> >>>> wrote: >>>>> >>>> > >> On 17/05/2022 21:32, Niranjana
Vishwanathapura
>>>>> wrote: >>>>> >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start >>>>> >>>> binding/unbinding >>>>> >>>> > the mapping in an >>>>> >>>> > >> > +async worker. The binding and >>>>>unbinding will >>>>> >>>>work like a >>>>> >>>> special >>>>> >>>> > GPU engine. >>>>> >>>> > >> > +The binding and unbinding operations are >>>>> serialized and >>>>> >>>> will >>>>> >>>> > wait on specified >>>>> >>>> > >> > +input fences before the operation >>>>>and will signal >>>>> the >>>>> >>>> output >>>>> >>>> > fences upon the >>>>> >>>> > >> > +completion of the operation. Due to >>>>> serialization, >>>>> >>>> completion of >>>>> >>>> > an operation >>>>> >>>> > >> > +will also indicate that all >>>>>previous operations >>>>> >>>>are also >>>>> >>>> > complete. >>>>> >>>> > >> >>>>> >>>> > >> I guess we should avoid saying "will >>>>>immediately >>>>> start >>>>> >>>> > binding/unbinding" if >>>>> >>>> > >> there are fences involved. >>>>> >>>> > >> >>>>> >>>> > >> And the fact that it's happening in an async >>>>> >>>>worker seem to >>>>> >>>> imply >>>>> >>>> > it's not >>>>> >>>> > >> immediate. >>>>> >>>> > >> >>>>> >>>> > >>>>> >>>> > Ok, will fix. >>>>> >>>> > This was added because in earlier design >>>>>binding was >>>>> deferred >>>>> >>>> until >>>>> >>>> > next execbuff. >>>>> >>>> > But now it is non-deferred (immediate in >>>>>that sense). >>>>> >>>>But yah, >>>>> >>>> this is >>>>> >>>> > confusing >>>>> >>>> > and will fix it. >>>>> >>>> > >>>>> >>>> > >> >>>>> >>>> > >> I have a question on the behavior of the bind >>>>> >>>>operation when >>>>> >>>> no >>>>> >>>> > input fence >>>>> >>>> > >> is provided. Let say I do : >>>>> >>>> > >> >>>>> >>>> > >> VM_BIND (out_fence=fence1) >>>>> >>>> > >> >>>>> >>>> > >> VM_BIND (out_fence=fence2) >>>>> >>>> > >> >>>>> >>>> > >> VM_BIND (out_fence=fence3) >>>>> >>>> > >> >>>>> >>>> > >> >>>>> >>>> > >> In what order are the fences going to >>>>>be signaled? >>>>> >>>> > >> >>>>> >>>> > >> In the order of VM_BIND ioctls? Or out >>>>>of order? >>>>> >>>> > >> >>>>> >>>> > >> Because you wrote "serialized I assume >>>>>it's : in >>>>> order >>>>> >>>> > >> >>>>> >>>> > >>>>> >>>> > Yes, in the order of VM_BIND/UNBIND >>>>>ioctls. Note that >>>>> >>>>bind and >>>>> >>>> unbind >>>>> >>>> > will use >>>>> >>>> > the same queue and hence are ordered. >>>>> >>>> > >>>>> >>>> > >> >>>>> >>>> > >> One thing I didn't realize is that >>>>>because we only >>>>> get one >>>>> >>>> > "VM_BIND" engine, >>>>> >>>> > >> there is a disconnect from the Vulkan >>>>>specification. >>>>> >>>> > >> >>>>> >>>> > >> In Vulkan VM_BIND operations are >>>>>serialized but >>>>> >>>>per engine. >>>>> >>>> > >> >>>>> >>>> > >> So you could have something like this : >>>>> >>>> > >> >>>>> >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, >>>>> out_fence=fence2) >>>>> >>>> > >> >>>>> >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, >>>>> out_fence=fence4) >>>>> >>>> > >> >>>>> >>>> > >> >>>>> >>>> > >> fence1 is not signaled >>>>> >>>> > >> >>>>> >>>> > >> fence3 is signaled >>>>> >>>> > >> >>>>> >>>> > >> So the second VM_BIND will proceed before the >>>>> >>>>first VM_BIND. >>>>> >>>> > >> >>>>> >>>> > >> >>>>> >>>> > >> I guess we can deal with that scenario in >>>>> >>>>userspace by doing >>>>> >>>> the >>>>> >>>> > wait >>>>> >>>> > >> ourselves in one thread per engines. >>>>> >>>> > >> >>>>> >>>> > >> But then it makes the VM_BIND input >>>>>fences useless. >>>>> >>>> > >> >>>>> >>>> > >> >>>>> >>>> > >> Daniel : what do you think? Should be >>>>>rework this or >>>>> just >>>>> >>>> deal with >>>>> >>>> > wait >>>>> >>>> > >> fences in userspace? >>>>> >>>> > >> >>>>> >>>> > > >>>>> >>>> > >My opinion is rework this but make the >>>>>ordering via >>>>> >>>>an engine >>>>> >>>> param >>>>> >>>> > optional. >>>>> >>>> > > >>>>> >>>> > >e.g. A VM can be configured so all binds >>>>>are ordered >>>>> >>>>within the >>>>> >>>> VM >>>>> >>>> > > >>>>> >>>> > >e.g. A VM can be configured so all binds >>>>>accept an >>>>> engine >>>>> >>>> argument >>>>> >>>> > (in >>>>> >>>> > >the case of the i915 likely this is a >>>>>gem context >>>>> >>>>handle) and >>>>> >>>> binds >>>>> >>>> > >ordered with respect to that engine. >>>>> >>>> > > >>>>> >>>> > >This gives UMDs options as the later >>>>>likely consumes >>>>> >>>>more KMD >>>>> >>>> > resources >>>>> >>>> > >so if a different UMD can live with binds being >>>>> >>>>ordered within >>>>> >>>> the VM >>>>> >>>> > >they can use a mode consuming less resources. >>>>> >>>> > > >>>>> >>>> > >>>>> >>>> > I think we need to be careful here if we >>>>>are looking >>>>> for some >>>>> >>>> out of >>>>> >>>> > (submission) order completion of vm_bind/unbind. >>>>> >>>> > In-order completion means, in a batch of >>>>>binds and >>>>> >>>>unbinds to be >>>>> >>>> > completed in-order, user only needs to specify >>>>> >>>>in-fence for the >>>>> >>>> > first bind/unbind call and the our-fence >>>>>for the last >>>>> >>>> bind/unbind >>>>> >>>> > call. Also, the VA released by an unbind >>>>>call can be >>>>> >>>>re-used by >>>>> >>>> > any subsequent bind call in that in-order batch. >>>>> >>>> > >>>>> >>>> > These things will break if >>>>>binding/unbinding were to >>>>> >>>>be allowed >>>>> >>>> to >>>>> >>>> > go out of order (of submission) and user >>>>>need to be >>>>> extra >>>>> >>>> careful >>>>> >>>> > not to run into pre-mature triggereing of >>>>>out-fence and >>>>> bind >>>>> >>>> failing >>>>> >>>> > as VA is still in use etc. >>>>> >>>> > >>>>> >>>> > Also, VM_BIND binds the provided mapping on the >>>>> specified >>>>> >>>> address >>>>> >>>> > space >>>>> >>>> > (VM). So, the uapi is not engine/context >>>>>specific. >>>>> >>>> > >>>>> >>>> > We can however add a 'queue' to the uapi >>>>>which can be >>>>> >>>>one from >>>>> >>>> the >>>>> >>>> > pre-defined queues, >>>>> >>>> > I915_VM_BIND_QUEUE_0 >>>>> >>>> > I915_VM_BIND_QUEUE_1 >>>>> >>>> > ... >>>>> >>>> > I915_VM_BIND_QUEUE_(N-1) >>>>> >>>> > >>>>> >>>> > KMD will spawn an async work queue for >>>>>each queue which >>>>> will >>>>> >>>> only >>>>> >>>> > bind the mappings on that queue in the order of >>>>> submission. >>>>> >>>> > User can assign the queue to per engine >>>>>or anything >>>>> >>>>like that. >>>>> >>>> > >>>>> >>>> > But again here, user need to be careful and not >>>>> >>>>deadlock these >>>>> >>>> > queues with circular dependency of fences. >>>>> >>>> > >>>>> >>>> > I prefer adding this later an as >>>>>extension based on >>>>> >>>>whether it >>>>> >>>> > is really helping with the implementation. >>>>> >>>> > >>>>> >>>> > I can tell you right now that having >>>>>everything on a >>>>> single >>>>> >>>> in-order >>>>> >>>> > queue will not get us the perf we want. >>>>>What vulkan >>>>> >>>>really wants >>>>> >>>> is one >>>>> >>>> > of two things: >>>>> >>>> > 1. No implicit ordering of VM_BIND ops. They just >>>>> happen in >>>>> >>>> whatever >>>>> >>>> > their dependencies are resolved and we >>>>>ensure ordering >>>>> >>>>ourselves >>>>> >>>> by >>>>> >>>> > having a syncobj in the VkQueue. >>>>> >>>> > 2. The ability to create multiple VM_BIND >>>>>queues. We >>>>> need at >>>>> >>>> least 2 >>>>> >>>> > but I don't see why there needs to be a >>>>>limit besides >>>>> >>>>the limits >>>>> >>>> the >>>>> >>>> > i915 API already has on the number of >>>>>engines. Vulkan >>>>> could >>>>> >>>> expose >>>>> >>>> > multiple sparse binding queues to the >>>>>client if it's not >>>>> >>>> arbitrarily >>>>> >>>> > limited. >>>>> >>>> >>>>> >>>> Thanks Jason, Lionel. >>>>> >>>> >>>>> >>>> Jason, what are you referring to when you say >>>>>"limits the i915 >>>>> API >>>>> >>>> already >>>>> >>>> has on the number of engines"? I am not sure if >>>>>there is such >>>>> an uapi >>>>> >>>> today. >>>>> >>>> >>>>> >>>> There's a limit of something like 64 total engines >>>>>today based on >>>>> the >>>>> >>>> number of bits we can cram into the exec flags in >>>>>execbuffer2. I >>>>> think >>>>> >>>> someone had an extended version that allowed more >>>>>but I ripped it >>>>> out >>>>> >>>> because no one was using it. Of course, >>>>>execbuffer3 might not >>>>> >>>>have that >>>>> >>>> problem at all. >>>>> >>>> >>>>> >>> >>>>> >>>Thanks Jason. >>>>> >>>Ok, I am not sure which exec flag is that, but yah, >>>>>execbuffer3 >>>>> probably >>>>> >>>will not have this limiation. So, we need to define a >>>>> VM_BIND_MAX_QUEUE >>>>> >>>and somehow export it to user (I am thinking of >>>>>embedding it in >>>>> >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' >>>>> meaning 2^n >>>>> >>>queues. >>>>> >> >>>>> >>Ah, I think you are waking about I915_EXEC_RING_MASK >>>>>(0x3f) which >>>>> execbuf3 >>>>> >>>>> Yup! That's exactly the limit I was talking about. >>>>> >>>>> >>will also have. So, we can simply define in
vm_bind/unbind
>>>>> structures, >>>>> >> >>>>> >>#define I915_VM_BIND_MAX_QUEUE 64 >>>>> >> __u32 queue; >>>>> >> >>>>> >>I think that will keep things simple. >>>>> > >>>>> >Hmmm? What does execbuf2 limit has to do with how many engines >>>>> >hardware can have? I suggest not to do that. >>>>> > >>>>> >Change with added this: >>>>> > >>>>> > if (set.num_engines > I915_EXEC_RING_MASK + 1) >>>>> > return -EINVAL; >>>>> > >>>>> >To context creation needs to be undone and so let users >>>>>create engine >>>>> >maps with all hardware engines, and let execbuf3 access >>>>>them all. >>>>> > >>>>> >>>>> Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to >>>>>execbuff3 also. >>>>> Hence, I was using the same limit for VM_BIND queues >>>>>(64, or 65 if we >>>>> make it N+1). >>>>> But, as discussed in other thread of this RFC series, we >>>>>are planning >>>>> to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be >>>>> any uapi that limits the number of engines (and hence >>>>>the vm_bind >>>>> queues >>>>> need to be supported). >>>>> >>>>> If we leave the number of vm_bind queues to be arbitrarily large >>>>> (__u32 queue_idx) then, we need to have a hashmap for >>>>>queue (a wq, >>>>> work_item and a linked list) lookup from the user >>>>>specified queue >>>>> index. >>>>> Other option is to just put some hard limit (say 64 or >>>>>65) and use >>>>> an array of queues in VM (each created upon first use). >>>>>I prefer this. >>>>> >>>>> I don't get why a VM_BIND queue is any different from any >>>>>other queue or >>>>> userspace-visible kernel object. But I'll leave those >>>>>details up to >>>>> danvet or whoever else might be reviewing the
implementation.
>>>>> --Jason >>>>> >>>>> I kind of agree here. Wouldn't be simpler to have the bind >>>>>queue created >>>>> like the others when we build the engine map? >>>>> >>>>> For userspace it's then just matter of selecting the right >>>>>queue ID when >>>>> submitting. >>>>> >>>>> If there is ever a possibility to have this work on the GPU, >>>>>it would be >>>>> all ready. >>>>> >>>> >>>>I did sync offline with Matt Brost on this. >>>>We can add a VM_BIND engine class and let user create
VM_BIND
>>>>engines (queues). >>>>The problem is, in i915 engine creating interface is bound to >>>>gem_context. >>>>So, in vm_bind ioctl, we would need both context_id and >>>>queue_idx for proper >>>>lookup of the user created engine. This is bit ackward as vm_bind is an >>>>interface to VM (address space) and has nothing to do with gem_context. >>> >>> >>>A gem_context has a single vm object right? >>> >>>Set through I915_CONTEXT_PARAM_VM at creation or given a
default
>>>one if not. >>> >>>So it's just like picking up the vm like it's done at execbuffer >>>time right now : eb->context->vm >>> >> >>Are you suggesting replacing 'vm_id' with 'context_id' in the >>VM_BIND/UNBIND >>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can
be
>>obtained >>from the context? > > >Yes, because if we go for engines, they're associated with a context >and so also associated with the VM bound to the context. >
Hmm...context doesn't sould like the right interface. It should be VM and engine (independent of context). Engine can be virtual or soft engine (kernel thread), each with its own queue. We can add an interface to create such engines (independent of context). But we are anway implicitly creating it when user uses a new queue_idx. If in future we have hardware engines for VM_BIND operation, we can have that explicit inteface to create engine instances and the queue_index in vm_bind/unbind will point to those engines. Anyone has any thoughts? Daniel?
Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.
So a cleaner interface to me is: user space create a vm, create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.
I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.
From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.
I didn't completely follow the discussion here. Just share some thoughts.
Yah, I agree.
Lionel, How about we define the queue as union { __u32 queue_idx; __u64 rsvd; }
If required, we can extend by expanding the 'rsvd' field to <ctx_id, queue_idx> later with a flag.
Niranjana
I did not really understand Oak's comment nor what you're suggesting here to be honest.
First the GEM context is already exposed to userspace. It's explicitly created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.
We give the GEM context id in every execbuffer we do with drm_i915_gem_execbuffer2::rsvd1.
It's still in the new execbuffer3 proposal being discussed.
Second, the GEM context is also where we set the VM with I915_CONTEXT_PARAM_VM.
Third, the GEM context also has the list of engines with I915_CONTEXT_PARAM_ENGINES.
Yes, the execbuf and engine map creation are tied to gem_context. (which probably is not the best interface.)
So it makes sense to me to dispatch the vm_bind operation to a GEM context, to a given vm_bind queue, because it's got all the information required :
- the list of new vm_bind queues
- the vm that is going to be modified
But the operation is performed here on the address space (VM) which can have multiple gem_contexts referring to it. So, VM is the right interface here. We need not 'gem_context'ify it.
All we need is multiple queue support for the address space (VM). Going to gem_context for that just because we have engine creation support there seems unnecessay and not correct to me.
Otherwise where do the vm_bind queues live?
In the i915/drm fd object?
That would mean that all the GEM contexts are sharing the same vm_bind queues.
Not all, only the gem contexts that are using the same address space (VM). But to me the right way to describe would be that "VM will be using those queues".
I hope by "queue" here you mean a HW resource that will be later used to execute the job, for example a ccs compute engine. Of course queue can be virtual so user can create more queues than what hw physically has.
To express the concept of "VM will be using those queues", I think it make sense to have create_queue(vm) function taking a vm parameter. This means this queue is created for the purpose of submit job under this VM. Later on, we can submit job (referring to objects vm_bound to the same vm) to the queue. The vm_bind ioctl doesn’t need to have queue parameter, just vm_bind (object, va, vm).
I hope the "queue" here is not the engine used to perform the vm_bind operation itself. But if you meant a queue/engine to perform vm_bind itself (vs a queue/engine for later job submission), then we can discuss more. I know xe driver have similar concept and I think align the design early can benefit the migration to xe driver.
Oops, I read more on this thread and it turned out the vm_bind queue here is actually used to perform vm bind/unbind operations. XE driver has the similar concept (except it is called engine_id there). So having a queue_idx parameter is closer to xe design.
That said, I still feel having a queue_idx parameter to vm_bind is a bit awkward. Vm_bind can be performed without any GPU engines, ie,. CPU itself can complete a vm bind as long as CPU have access to gpu's local memory. So the queue here have to be a virtual concept - it doesn't have a hard map to GPU blitter engine.
Can someone summarize what is the benefit of the queue-idx parameter? For the purpose of ordering vm_bind and later gpu jobs?
Regards, Oak
Niranjana
intel_context or GuC are internal details we're not concerned about.
I don't really see the connection with the GEM context.
Maybe Oak has a different use case than Vulkan.
-Lionel
Regards, Oak
Niranjana
> >>I think the interface is clean as a interface to VM. It is only that we >>don't have a clean way to create a raw VM_BIND engine (not >>associated with >>any context) with i915 uapi. >>May be we can add such an interface, but I don't think that is worth it >>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl
as I
>>mentioned >>above). >>Anyone has any thoughts? >> >>> >>>>Another problem is, if two VMs are binding with the same
defined
>>>>engine, >>>>binding on VM1 can get unnecessary blocked by binding on VM2 >>>>(which may be >>>>waiting on its in_fence). >>> >>> >>>Maybe I'm missing something, but how can you have 2 vm objects >>>with a single gem_context right now? >>> >> >>No, we don't have 2 VMs for a gem_context. >>Say if ctx1 with vm1 and ctx2 with vm2. >>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. >>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If >>those two queue indicies points to same underlying vm_bind
engine,
>>then the second vm_bind call gets blocked until the first vm_bind call's >>'in' fence is triggered and bind completes. >> >>With per VM queues, this is not a problem as two VMs will not
endup
>>sharing same queue. >> >>BTW, I just posted a updated PATCH series. >>https://www.spinics.net/lists/dri-devel/msg350483.html >> >>Niranjana >> >>> >>>> >>>>So, my preference here is to just add a 'u32 queue' index in >>>>vm_bind/unbind >>>>ioctl, and the queues are per VM. >>>> >>>>Niranjana >>>> >>>>> Thanks, >>>>> >>>>> -Lionel >>>>> >>>>> >>>>> Niranjana >>>>> >>>>> >Regards, >>>>> > >>>>> >Tvrtko >>>>> > >>>>> >> >>>>> >>Niranjana >>>>> >> >>>>> >>> >>>>> >>>> I am trying to see how many queues we need and >>>>>don't want it to >>>>> be >>>>> >>>> arbitrarily >>>>> >>>> large and unduely blow up memory usage and >>>>>complexity in i915 >>>>> driver. >>>>> >>>> >>>>> >>>> I expect a Vulkan driver to use at most 2 in the >>>>>vast majority >>>>> >>>>of cases. I >>>>> >>>> could imagine a client wanting to create more than 1 sparse >>>>> >>>>queue in which >>>>> >>>> case, it'll be N+1 but that's unlikely. As far as >>>>>complexity >>>>> >>>>goes, once >>>>> >>>> you allow two, I don't think the complexity is going up by >>>>> >>>>allowing N. As >>>>> >>>> for memory usage, creating more queues means more >>>>>memory. That's >>>>> a >>>>> >>>> trade-off that userspace can make. Again, the >>>>>expected number >>>>> >>>>here is 1 >>>>> >>>> or 2 in the vast majority of cases so I don't think >>>>>you need to >>>>> worry. >>>>> >>> >>>>> >>>Ok, will start with n=3 meaning 8 queues. >>>>> >>>That would require us create 8 workqueues. >>>>> >>>We can change 'n' later if required. >>>>> >>> >>>>> >>>Niranjana >>>>> >>> >>>>> >>>> >>>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>>> >>>>operations and we >>>>> >>>> don't >>>>> >>>> > want any dependencies between them: >>>>> >>>> > 1. Immediate. These happen right after BO >>>>>creation or >>>>> >>>>maybe as >>>>> >>>> part of >>>>> >>>> > vkBindImageMemory() or
VkBindBufferMemory(). These
>>>>> >>>>don't happen >>>>> >>>> on a >>>>> >>>> > queue and we don't want them serialized >>>>>with anything. To >>>>> >>>> synchronize >>>>> >>>> > with submit, we'll have a syncobj in the >>>>>VkDevice which >>>>> is >>>>> >>>> signaled by >>>>> >>>> > all immediate bind operations and make >>>>>submits wait on >>>>> it. >>>>> >>>> > 2. Queued (sparse): These happen on a >>>>>VkQueue which may >>>>> be the >>>>> >>>> same as >>>>> >>>> > a render/compute queue or may be its own >>>>>queue. It's up >>>>> to us >>>>> >>>> what we >>>>> >>>> > want to advertise. From the Vulkan API >>>>>PoV, this is like >>>>> any >>>>> >>>> other >>>>> >>>> > queue. Operations on it wait on and signal >>>>>semaphores. If we >>>>> >>>> have a >>>>> >>>> > VM_BIND engine, we'd provide syncobjs to wait and >>>>> >>>>signal just like >>>>> >>>> we do >>>>> >>>> > in execbuf(). >>>>> >>>> > The important thing is that we don't want >>>>>one type of >>>>> >>>>operation to >>>>> >>>> block >>>>> >>>> > on the other. If immediate binds are >>>>>blocking on sparse >>>>> binds, >>>>> >>>> it's >>>>> >>>> > going to cause over-synchronization issues. >>>>> >>>> > In terms of the internal implementation, I >>>>>know that >>>>> >>>>there's going >>>>> >>>> to be >>>>> >>>> > a lock on the VM and that we can't actually >>>>>do these >>>>> things in >>>>> >>>> > parallel. That's fine. Once the dma_fences have >>>>> signaled and >>>>> >>>> we're >>>>> >>>> >>>>> >>>> Thats correct. It is like a single VM_BIND engine with >>>>> >>>>multiple queues >>>>> >>>> feeding to it. >>>>> >>>> >>>>> >>>> Right. As long as the queues themselves are >>>>>independent and >>>>> >>>>can block on >>>>> >>>> dma_fences without holding up other queues, I think >>>>>we're fine. >>>>> >>>> >>>>> >>>> > unblocked to do the bind operation, I don't care if >>>>> >>>>there's a bit >>>>> >>>> of >>>>> >>>> > synchronization due to locking. That's >>>>>expected. What >>>>> >>>>we can't >>>>> >>>> afford >>>>> >>>> > to have is an immediate bind operation >>>>>suddenly blocking >>>>> on a >>>>> >>>> sparse >>>>> >>>> > operation which is blocked on a compute job >>>>>that's going >>>>> to run >>>>> >>>> for >>>>> >>>> > another 5ms. >>>>> >>>> >>>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one
VM
>>>>>doesn't block >>>>> the >>>>> >>>> VM_BIND >>>>> >>>> on other VMs. I am not sure about usecases here, but just >>>>> wanted to >>>>> >>>> clarify. >>>>> >>>> >>>>> >>>> Yes, that's what I would expect. >>>>> >>>> --Jason >>>>> >>>> >>>>> >>>> Niranjana >>>>> >>>> >>>>> >>>> > For reference, Windows solves this by allowing >>>>> arbitrarily many >>>>> >>>> paging >>>>> >>>> > queues (what they call a VM_BIND >>>>>engine/queue). That >>>>> >>>>design works >>>>> >>>> > pretty well and solves the problems in >>>>>question. >>>>Again, we could >>>>> >>>> just >>>>> >>>> > make everything out-of-order and require >>>>>using syncobjs >>>>> >>>>to order >>>>> >>>> things >>>>> >>>> > as userspace wants. That'd be fine too. >>>>> >>>> > One more note while I'm here: danvet said >>>>>something on >>>>> >>>>IRC about >>>>> >>>> VM_BIND >>>>> >>>> > queues waiting for syncobjs to >>>>>materialize. We don't >>>>> really >>>>> >>>> want/need >>>>> >>>> > this. We already have all the machinery in >>>>>userspace to >>>>> handle >>>>> >>>> > wait-before-signal and waiting for syncobj >>>>>fences to >>>>> >>>>materialize >>>>> >>>> and >>>>> >>>> > that machinery is on by default. It would actually >>>>> >>>>take MORE work >>>>> >>>> in >>>>> >>>> > Mesa to turn it off and take advantage of >>>>>the kernel >>>>> >>>>being able to >>>>> >>>> wait >>>>> >>>> > for syncobjs to materialize. Also, getting >>>>>that right is >>>>> >>>> ridiculously >>>>> >>>> > hard and I really don't want to get it >>>>>wrong in kernel >>>>> >>>>space. �� When we >>>>> >>>> > do memory fences, wait-before-signal will >>>>>be a thing. We >>>>> don't >>>>> >>>> need to >>>>> >>>> > try and make it a thing for syncobj. >>>>> >>>> > --Jason >>>>> >>>> > >>>>> >>>> > Thanks Jason, >>>>> >>>> > >>>>> >>>> > I missed the bit in the Vulkan spec that >>>>>we're allowed to >>>>> have a >>>>> >>>> sparse >>>>> >>>> > queue that does not implement either graphics >>>>>or compute >>>>> >>>>operations >>>>> >>>> : >>>>> >>>> > >>>>> >>>> > "While some implementations may include >>>>> >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>>> >>>> > support in queue families that also include >>>>> >>>> > >>>>> >>>> > graphics and compute support, other >>>>>implementations may >>>>> only >>>>> >>>> expose a >>>>> >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>>> >>>> > >>>>> >>>> > family." >>>>> >>>> > >>>>> >>>> > So it can all be all a vm_bind engine that just does >>>>> bind/unbind >>>>> >>>> > operations. >>>>> >>>> > >>>>> >>>> > But yes we need another engine for the >>>>>immediate/non-sparse >>>>> >>>> operations. >>>>> >>>> > >>>>> >>>> > -Lionel >>>>> >>>> > >>>>> >>>> > > >>>>> >>>> > Daniel, any thoughts? >>>>> >>>> > >>>>> >>>> > Niranjana >>>>> >>>> > >>>>> >>>> > >Matt >>>>> >>>> > > >>>>> >>>> > >> >>>>> >>>> > >> Sorry I noticed this late. >>>>> >>>> > >> >>>>> >>>> > >> >>>>> >>>> > >> -Lionel >>>>> >>>> > >> >>>>> >>>> > >> >>> >>> >
On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.
And the fact that it's happening in an async worker seem to imply it's not immediate.
I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :
VM_BIND (out_fence=fence1)
VM_BIND (out_fence=fence2)
VM_BIND (out_fence=fence3)
In what order are the fences going to be signaled?
In the order of VM_BIND ioctls? Or out of order?
Because you wrote "serialized I assume it's : in order
One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.
In Vulkan VM_BIND operations are serialized but per engine.
So you could have something like this :
VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
Question - let's say this done after the above operations:
EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
Is the exec ordered with respected to bind (i.e. would fence3 & 4 be signaled before the exec starts)?
Matt
fence1 is not signaled
fence3 is signaled
So the second VM_BIND will proceed before the first VM_BIND.
I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.
But then it makes the VM_BIND input fences useless.
Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?
Sorry I noticed this late.
-Lionel
On 02/06/2022 00:18, Matthew Brost wrote:
On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.
And the fact that it's happening in an async worker seem to imply it's not immediate.
I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :
VM_BIND (out_fence=fence1)
VM_BIND (out_fence=fence2)
VM_BIND (out_fence=fence3)
In what order are the fences going to be signaled?
In the order of VM_BIND ioctls? Or out of order?
Because you wrote "serialized I assume it's : in order
One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.
In Vulkan VM_BIND operations are serialized but per engine.
So you could have something like this :
VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
Question - let's say this done after the above operations:
EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
Is the exec ordered with respected to bind (i.e. would fence3 & 4 be signaled before the exec starts)?
Matt
Hi Matt,
From the vulkan point of view, everything is serialized within an engine (we map that to a VkQueue).
So with :
EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
EXEC completes first then VM_BIND executes.
To be even clearer :
EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
EXEC will wait until fence2 is signaled. Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
-Lionel
fence1 is not signaled
fence3 is signaled
So the second VM_BIND will proceed before the first VM_BIND.
I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.
But then it makes the VM_BIND input fences useless.
Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?
Sorry I noticed this late.
-Lionel
On Thu, Jun 02, 2022 at 08:42:13AM +0300, Lionel Landwerlin wrote:
On 02/06/2022 00:18, Matthew Brost wrote:
On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.
And the fact that it's happening in an async worker seem to imply it's not immediate.
I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :
VM_BIND (out_fence=fence1)
VM_BIND (out_fence=fence2)
VM_BIND (out_fence=fence3)
In what order are the fences going to be signaled?
In the order of VM_BIND ioctls? Or out of order?
Because you wrote "serialized I assume it's : in order
One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.
In Vulkan VM_BIND operations are serialized but per engine.
So you could have something like this :
VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
Question - let's say this done after the above operations:
EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
Is the exec ordered with respected to bind (i.e. would fence3 & 4 be signaled before the exec starts)?
Matt
Hi Matt,
From the vulkan point of view, everything is serialized within an engine (we map that to a VkQueue).
So with :
EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
EXEC completes first then VM_BIND executes.
To be even clearer :
EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
EXEC will wait until fence2 is signaled. Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
Yea this makes sense. I think of VM_BINDs as more or less just another version of an EXEC and this fits with that.
In practice I don't think we can share a ring but we should be able to present an engine (again likely a gem context in i915) to the user that orders VM_BINDs / EXECs if that is what Vulkan expects, at least I think.
Hopefully Niranjana + Daniel agree.
Matt
-Lionel
fence1 is not signaled
fence3 is signaled
So the second VM_BIND will proceed before the first VM_BIND.
I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.
But then it makes the VM_BIND input fences useless.
Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?
Sorry I noticed this late.
-Lionel
On Thu, Jun 02, 2022 at 09:22:46AM -0700, Matthew Brost wrote:
On Thu, Jun 02, 2022 at 08:42:13AM +0300, Lionel Landwerlin wrote:
On 02/06/2022 00:18, Matthew Brost wrote:
On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.
And the fact that it's happening in an async worker seem to imply it's not immediate.
I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :
VM_BIND (out_fence=fence1)
VM_BIND (out_fence=fence2)
VM_BIND (out_fence=fence3)
In what order are the fences going to be signaled?
In the order of VM_BIND ioctls? Or out of order?
Because you wrote "serialized I assume it's : in order
One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.
In Vulkan VM_BIND operations are serialized but per engine.
So you could have something like this :
VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
Question - let's say this done after the above operations:
EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
Is the exec ordered with respected to bind (i.e. would fence3 & 4 be signaled before the exec starts)?
Matt
Hi Matt,
From the vulkan point of view, everything is serialized within an engine (we map that to a VkQueue).
So with :
EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
EXEC completes first then VM_BIND executes.
To be even clearer :
EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
EXEC will wait until fence2 is signaled. Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
Yea this makes sense. I think of VM_BINDs as more or less just another version of an EXEC and this fits with that.
Note that VM_BIND itself can bind while and EXEC (GPU job) is running. (Say, getting binds ready for next submission). It is up to user though, how to use it.
In practice I don't think we can share a ring but we should be able to present an engine (again likely a gem context in i915) to the user that orders VM_BINDs / EXECs if that is what Vulkan expects, at least I think.
I have responded in the other thread on this.
Niranjana
Hopefully Niranjana + Daniel agree.
Matt
-Lionel
fence1 is not signaled
fence3 is signaled
So the second VM_BIND will proceed before the first VM_BIND.
I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.
But then it makes the VM_BIND input fences useless.
Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?
Sorry I noticed this late.
-Lionel
On Thu, Jun 2, 2022 at 7:42 AM Lionel Landwerlin lionel.g.landwerlin@intel.com wrote:
On 02/06/2022 00:18, Matthew Brost wrote:
On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.
And the fact that it's happening in an async worker seem to imply it's not immediate.
I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :
VM_BIND (out_fence=fence1)
VM_BIND (out_fence=fence2)
VM_BIND (out_fence=fence3)
In what order are the fences going to be signaled?
In the order of VM_BIND ioctls? Or out of order?
Because you wrote "serialized I assume it's : in order
One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.
Note that in Vulkan not every queue has to support sparse binding, so one could consider a dedicated sparse binding only queue family.
In Vulkan VM_BIND operations are serialized but per engine.
So you could have something like this :
VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
Question - let's say this done after the above operations:
EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)
Is the exec ordered with respected to bind (i.e. would fence3 & 4 be signaled before the exec starts)?
Matt
Hi Matt,
From the vulkan point of view, everything is serialized within an engine (we map that to a VkQueue).
So with :
EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
EXEC completes first then VM_BIND executes.
To be even clearer :
EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
EXEC will wait until fence2 is signaled. Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.
It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.
-Lionel
fence1 is not signaled
fence3 is signaled
So the second VM_BIND will proceed before the first VM_BIND.
I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.
But then it makes the VM_BIND input fences useless.
I posed the same question on my series for AMD (https://patchwork.freedesktop.org/series/104578/), albeit for slightly different reasons.: if one creates a new VkMemory object, you generally want that mapped ASAP, as you can't track (in a VK_KHR_descriptor_indexing world) whether the next submit is going to use this VkMemory object and hence have to assume the worst (i.e. wait till the map/bind is complete before executing the next submission). If all binds/unbinds (or maps/unmaps) happen in-order that means an operation with input fences could delay stuff we want ASAP.
Of course waiting in userspace does have disadvantages:
1) more overhead between fence signalling and the operation, potentially causing slightly bigger GPU bubbles. 2) You can't get an out fence early. Within the driver we can mostly work around this but sync_fd exports, WSI and such will be messy. 3) moving the queue to a thread might make things slightly less ideal due to scheduling delays.
Removing the in-order working in the kernel generally seems like madness to me as it is very hard to keep track of the state of the virtual address space (to e.g. track umapping stuff before freeing memory or moving memory around)
the one game I tried (FH5 over vkd3d-proton) does sparse mapping as follows:
separate queue: 1) 0 cmdbuffer submit with 0 input semaphores and 1 output semaphore 2) sparse bind with input semaphore from 1 and 1 output semaphore 3) 0 cmdbuffer submit with input semaphore from 2 and 1 output fence 4) wait on that fence on the CPU
which works very well if we just wait for the sparse bind input semaphore in userspace, but I'm still working on seeing if this is the common usecase or an outlier.
Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?
Sorry I noticed this late.
-Lionel
Regards, Oak
-----Original Message----- From: dri-devel dri-devel-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: May 17, 2022 2:32 PM To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter, Daniel daniel.vetter@intel.com Cc: Brost, Matthew matthew.brost@intel.com; Hellstrom, Thomas thomas.hellstrom@intel.com; jason@jlekstrand.net; Wilson, Chris P chris.p.wilson@intel.com; christian.koenig@amd.com Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
VM_BIND design document with description of intended use cases.
v2: Add more documentation and format as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura
niranjana.vishwanathapura@intel.com
Documentation/driver-api/dma-buf.rst | 2 + Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 3 files changed, 310 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver- api/dma-buf.rst index 36a76cbe9095..64cb924ec5bb 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File .. kernel-doc:: include/linux/sync_file.h :internal:
+.. _indefinite_dma_fences:
Indefinite DMA Fences
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..f1be560d313c --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,304 @@ +========================================== +I915 VM_BIND feature design and use cases +========================================== + +VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a +specified address space (VM). These mappings (also referred to as persistent +mappings) will be persistent across multiple GPU submissions (execbuff calls) +issued by the UMD, without user having to provide a list of all required +mappings during each submission (as required by older execbuff mode). + +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace +to specify how the binding/unbinding should sync with other operations +like the GPU job submission. These fences will be timeline 'drm_syncobj's +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences). +For Compute contexts, they will be user/memory fences (See struct +drm_i915_vm_bind_ext_user_fence). + +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. + +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
Hi,
Is user required to wait for the out fence be signaled before submit a gpu job using the vm_bind address? Or is user required to order the gpu job to make gpu job run after vm_bind out fence signaled?
I think there could be different behavior on a non-faultable platform and a faultable platform, such as on a non-faultable Platform, gpu job is required to be order after vm_bind out fence signaling; and on a faultable platform, there is no such Restriction since vm bind can be finished in the fault handler?
Should we document such thing?
Regards, Oak
+VM_BIND features include:
+* Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
- use cases will be helpful.
+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this).
+Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff mode of +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence, +no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases:
+"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/)
+This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later.
+In VM_BIND mode, VA allocation is completely managed by the user instead of +the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_).
+So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up +by clearly separating out the functionalities where the VM_BIND mode differs +from older method and they should be moved to separate files.
+VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
+VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode, the +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page faults manages +residency of backing storage using dma_fence. The VM_BIND mode with page faults +and the system allocator support do not use any dma_fence at all.
+VM_BIND locking order is as below.
+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
- vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
- mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple page fault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
- The older execbuff mode of binding do not need this lock.
+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
- be held while binding/unbinding a vma in the async worker and while updating
- dma-resv fence list of an object. Note that private BOs of a VM will all
- share a dma-resv object.
- The future system allocator support will use the HMM prescribed locking
- instead.
+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
- invalidated vmas (due to eviction and userptr invalidation) etc.
+When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we need to +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for +system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm lists, +so won't ever need lock-C.
+VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid +performance degradation. We will also need support for bulk LRU movement of +VM_BIND objects to avoid additional latencies in execbuff path.
+The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call with +that VM). So, bulk LRU movement of page table pages is also needed.
+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture.
+VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency +setting (either through explicit or implicit mechanism).
+When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff submission.
+Also, in VM_BIND mode, use dma-resv apis for determining object activeness +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for igfx.
+Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated).
+Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful.
+VM_BIND Compute support +========================
+User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during +context creation. The dma-fence based user interfaces like gem_wait ioctl and +execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption.
+As this support for context suspension using a preempt fence and the resume work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute mode +contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the directly +submitted jobs.
+Other VM_BIND use cases +========================
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debugged) and attached +to GPU via vm_bind interface.
+GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of +binding will require using dma-fence to ensure residency, the GPU page faults +mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries.
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled.
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +use cases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
- it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
- is active or not.
+VM_BIND UAPI +=============
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
-- 2.21.0.rc0.32.g243a4c7e27
On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:
Regards, Oak
-----Original Message----- From: dri-devel dri-devel-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: May 17, 2022 2:32 PM To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter, Daniel daniel.vetter@intel.com Cc: Brost, Matthew matthew.brost@intel.com; Hellstrom, Thomas thomas.hellstrom@intel.com; jason@jlekstrand.net; Wilson, Chris P chris.p.wilson@intel.com; christian.koenig@amd.com Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
VM_BIND design document with description of intended use cases.
v2: Add more documentation and format as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura
niranjana.vishwanathapura@intel.com
Documentation/driver-api/dma-buf.rst | 2 + Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 3 files changed, 310 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver- api/dma-buf.rst index 36a76cbe9095..64cb924ec5bb 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File .. kernel-doc:: include/linux/sync_file.h :internal:
+.. _indefinite_dma_fences:
Indefinite DMA Fences
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..f1be560d313c --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,304 @@ +========================================== +I915 VM_BIND feature design and use cases +========================================== + +VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a +specified address space (VM). These mappings (also referred to as persistent +mappings) will be persistent across multiple GPU submissions (execbuff calls) +issued by the UMD, without user having to provide a list of all required +mappings during each submission (as required by older execbuff mode). + +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace +to specify how the binding/unbinding should sync with other operations +like the GPU job submission. These fences will be timeline 'drm_syncobj's +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences). +For Compute contexts, they will be user/memory fences (See struct +drm_i915_vm_bind_ext_user_fence). + +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. + +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
Hi,
Is user required to wait for the out fence be signaled before submit a gpu job using the vm_bind address? Or is user required to order the gpu job to make gpu job run after vm_bind out fence signaled?
Thanks Oak, Either should be fine and up to user how to use vm_bind/unbind out-fence.
I think there could be different behavior on a non-faultable platform and a faultable platform, such as on a non-faultable Platform, gpu job is required to be order after vm_bind out fence signaling; and on a faultable platform, there is no such Restriction since vm bind can be finished in the fault handler?
With GPU page faults handler, out fence won't be needed as residency is purely managed by page fault handler populating page tables (there is a mention of it in GPU Page Faults section below).
Should we document such thing?
We don't talk much about GPU page faults case in this document as that may warrent a separate rfc when we add page faults support. We did mention it in couple places to ensure our locking design here is extensible to gpu page faults case.
Niranjana
Regards, Oak
+VM_BIND features include:
+* Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
- use cases will be helpful.
+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this).
+Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff mode of +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence, +no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases:
+"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/)
+This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later.
+In VM_BIND mode, VA allocation is completely managed by the user instead of +the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_).
+So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up +by clearly separating out the functionalities where the VM_BIND mode differs +from older method and they should be moved to separate files.
+VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
+VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode, the +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page faults manages +residency of backing storage using dma_fence. The VM_BIND mode with page faults +and the system allocator support do not use any dma_fence at all.
+VM_BIND locking order is as below.
+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
- vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the
- mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple page fault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
- The older execbuff mode of binding do not need this lock.
+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to
- be held while binding/unbinding a vma in the async worker and while updating
- dma-resv fence list of an object. Note that private BOs of a VM will all
- share a dma-resv object.
- The future system allocator support will use the HMM prescribed locking
- instead.
+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
- invalidated vmas (due to eviction and userptr invalidation) etc.
+When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we need to +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for +system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm lists, +so won't ever need lock-C.
+VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid +performance degradation. We will also need support for bulk LRU movement of +VM_BIND objects to avoid additional latencies in execbuff path.
+The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call with +that VM). So, bulk LRU movement of page table pages is also needed.
+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture.
+VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency +setting (either through explicit or implicit mechanism).
+When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff submission.
+Also, in VM_BIND mode, use dma-resv apis for determining object activeness +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for igfx.
+Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated).
+Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful.
+VM_BIND Compute support +========================
+User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during +context creation. The dma-fence based user interfaces like gem_wait ioctl and +execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption.
+As this support for context suspension using a preempt fence and the resume work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute mode +contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the directly +submitted jobs.
+Other VM_BIND use cases +========================
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debugged) and attached +to GPU via vm_bind interface.
+GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of +binding will require using dma-fence to ensure residency, the GPU page faults +mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries.
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled.
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +use cases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
- it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma
- is active or not.
+VM_BIND UAPI +=============
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
-- 2.21.0.rc0.32.g243a4c7e27
Regards, Oak
-----Original Message----- From: Vishwanathapura, Niranjana niranjana.vishwanathapura@intel.com Sent: June 2, 2022 4:49 PM To: Zeng, Oak oak.zeng@intel.com Cc: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter, Daniel daniel.vetter@intel.com; Brost, Matthew matthew.brost@intel.com; Hellstrom, Thomas thomas.hellstrom@intel.com; jason@jlekstrand.net; Wilson, Chris P chris.p.wilson@intel.com; christian.koenig@amd.com Subject: Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:
Regards, Oak
-----Original Message----- From: dri-devel dri-devel-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: May 17, 2022 2:32 PM To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter, Daniel daniel.vetter@intel.com Cc: Brost, Matthew matthew.brost@intel.com; Hellstrom, Thomas thomas.hellstrom@intel.com; jason@jlekstrand.net; Wilson, Chris P chris.p.wilson@intel.com; christian.koenig@amd.com Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document
VM_BIND design document with description of intended use cases.
v2: Add more documentation and format as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura
niranjana.vishwanathapura@intel.com
Documentation/driver-api/dma-buf.rst | 2 + Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 3 files changed, 310 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver- api/dma-buf.rst index 36a76cbe9095..64cb924ec5bb 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File .. kernel-doc:: include/linux/sync_file.h :internal:
+.. _indefinite_dma_fences:
Indefinite DMA Fences
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..f1be560d313c --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,304 @@ +========================================== +I915 VM_BIND feature design and use cases +========================================== + +VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a +specified address space (VM). These mappings (also referred to as persistent +mappings) will be persistent across multiple GPU submissions (execbuff calls) +issued by the UMD, without user having to provide a list of all required +mappings during each submission (as required by older execbuff mode). + +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace +to specify how the binding/unbinding should sync with other operations +like the GPU job submission. These fences will be timeline 'drm_syncobj's +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences). +For Compute contexts, they will be user/memory fences (See struct +drm_i915_vm_bind_ext_user_fence). + +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. + +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the
mapping in
an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.
Hi,
Is user required to wait for the out fence be signaled before submit a gpu job
using the vm_bind address?
Or is user required to order the gpu job to make gpu job run after vm_bind out
fence signaled?
Thanks Oak, Either should be fine and up to user how to use vm_bind/unbind out-fence.
I think there could be different behavior on a non-faultable platform and a
faultable platform, such as on a non-faultable
Platform, gpu job is required to be order after vm_bind out fence signaling; and
on a faultable platform, there is no such
Restriction since vm bind can be finished in the fault handler?
With GPU page faults handler, out fence won't be needed as residency is purely managed by page fault handler populating page tables (there is a mention of it in GPU Page Faults section below).
Should we document such thing?
We don't talk much about GPU page faults case in this document as that may warrent a separate rfc when we add page faults support. We did mention it in couple places to ensure our locking design here is extensible to gpu page faults case.
Ok, that makes sense to me. Thanks for explaining.
Regards, Oak
Niranjana
Regards, Oak
+VM_BIND features include:
+* Multiple Virtual Address (VA) mappings can map to the same physical
pages
- of an object (aliasing).
+* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some
- use cases will be helpful.
+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this).
+Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff
mode of
+binding. In VM_BIND mode, execbuff ioctl will not accept any execlist.
Hence,
+no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases:
+"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/)
+This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
+If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later.
+In VM_BIND mode, VA allocation is completely managed by the user instead
of
+the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode
will
not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_).
+So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned
up
+by clearly separating out the functionalities where the VM_BIND mode
differs
+from older method and they should be moved to separate files.
+VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO
which
+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped
on
+the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each
execbuff
+submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is
O(1)
+w.r.t the number of VM private BOs.
+VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode,
the
+newer VM_BIND mode, the VM_BIND mode with GPU page faults and
possible
future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page
faults
manages +residency of backing storage using dma_fence. The VM_BIND mode with
page
faults +and the system allocator support do not use any dma_fence at all.
+VM_BIND locking order is as below.
+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in
- vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing
the
- mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple page fault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
- The older execbuff mode of binding do not need this lock.
+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs
to
- be held while binding/unbinding a vma in the async worker and while
updating
- dma-resv fence list of an object. Note that private BOs of a VM will all
- share a dma-resv object.
- The future system allocator support will use the HMM prescribed locking
- instead.
+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of
- invalidated vmas (due to eviction and userptr invalidation) etc.
+When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we
need to
+find the backing storage (dma_resv lock for gem objects, and hmm/core mm
for
+system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm
lists,
+so won't ever need lock-C.
+VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to
avoid
+performance degradation. We will also need support for bulk LRU movement
of
+VM_BIND objects to avoid additional latencies in execbuff path.
+The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call
with
+that VM). So, bulk LRU movement of page table pages is also needed.
+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture.
+VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during
object
dependency +setting (either through explicit or implicit mechanism).
+When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff
submission.
+Also, in VM_BIND mode, use dma-resv apis for determining object
activeness
+(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not
use
the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for
igfx.
+Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated).
+Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan
and
Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful.
+VM_BIND Compute support +========================
+User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the
user
+supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer
to
+signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake
up
+the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of
time.
+Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute
must
opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
flag
during +context creation. The dma-fence based user interfaces like gem_wait ioctl
and
+execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode.
+Where GPU page faults are not available, kernel driver upon buffer
invalidation
+will initiate a suspend (preemption) of long running context with a dma-
fence
+attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption.
+As this support for context suspension using a preempt fence and the
resume
work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute
mode
+contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through
execbuff
+ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the
directly
+submitted jobs.
+Other VM_BIND use cases +========================
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep
track
+of and act upon resources created by another process (debugged) and
attached
+to GPU via vm_bind interface.
+GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer
VM_BIND
mode of +binding will require using dma-fence to ensure residency, the GPU page
faults
+mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries.
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without
gem
BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled.
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its
own
+use cases to support and the locking requirements requires proper
integration
+with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using
- it. Instead use underlying BO's dma-resv fence list to determine if a
i915_vma
- is active or not.
+VM_BIND UAPI +=============
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst
b/Documentation/gpu/rfc/index.rst
index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
-- 2.21.0.rc0.32.g243a4c7e27
Add some missing i915 upai documentation which the new i915 VM_BIND feature documentation will be refer to.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com --- include/uapi/drm/i915_drm.h | 153 +++++++++++++++++++++++++++--------- 1 file changed, 116 insertions(+), 37 deletions(-)
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h index a2def7b27009..8c834a31b56f 100644 --- a/include/uapi/drm/i915_drm.h +++ b/include/uapi/drm/i915_drm.h @@ -751,9 +751,16 @@ typedef struct drm_i915_irq_wait {
/* Must be kept compact -- no holes and well documented */
+/** + * typedef drm_i915_getparam_t - Driver parameter query structure. + */ typedef struct drm_i915_getparam { + /** @param: Driver parameter to query. */ __s32 param; - /* + + /** + * @value: Address of memory where queried value should be put. + * * WARNING: Using pointers instead of fixed-size u64 means we need to write * compat32 code. Don't repeat this mistake. */ @@ -1239,76 +1246,114 @@ struct drm_i915_gem_exec_object2 { __u64 rsvd2; };
+/** + * struct drm_i915_gem_exec_fence - An input or output fence for the execbuff + * ioctl. + * + * The request will wait for input fence to signal before submission. + * + * The returned output fence will be signaled after the completion of the + * request. + */ struct drm_i915_gem_exec_fence { - /** - * User's handle for a drm_syncobj to wait on or signal. - */ + /** @handle: User's handle for a drm_syncobj to wait on or signal. */ __u32 handle;
+ /** + * @flags: Supported flags are, + * + * I915_EXEC_FENCE_WAIT: + * Wait for the input fence before request submission. + * + * I915_EXEC_FENCE_SIGNAL: + * Return request completion fence as output + */ + __u32 flags; #define I915_EXEC_FENCE_WAIT (1<<0) #define I915_EXEC_FENCE_SIGNAL (1<<1) #define __I915_EXEC_FENCE_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_SIGNAL << 1)) - __u32 flags; };
-/* - * See drm_i915_gem_execbuffer_ext_timeline_fences. - */ -#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0 - -/* +/** + * struct drm_i915_gem_execbuffer_ext_timeline_fences - Timeline fences + * for execbuff. + * * This structure describes an array of drm_syncobj and associated points for * timeline variants of drm_syncobj. It is invalid to append this structure to * the execbuf if I915_EXEC_FENCE_ARRAY is set. */ struct drm_i915_gem_execbuffer_ext_timeline_fences { +#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0 + /** @base: Extension link. See struct i915_user_extension. */ struct i915_user_extension base;
/** - * Number of element in the handles_ptr & value_ptr arrays. + * @fence_count: Number of element in the @handles_ptr & @value_ptr + * arrays. */ __u64 fence_count;
/** - * Pointer to an array of struct drm_i915_gem_exec_fence of length - * fence_count. + * @handles_ptr: Pointer to an array of struct drm_i915_gem_exec_fence + * of length @fence_count. */ __u64 handles_ptr;
/** - * Pointer to an array of u64 values of length fence_count. Values - * must be 0 for a binary drm_syncobj. A Value of 0 for a timeline - * drm_syncobj is invalid as it turns a drm_syncobj into a binary one. + * @values_ptr: Pointer to an array of u64 values of length + * @fence_count. + * Values must be 0 for a binary drm_syncobj. A Value of 0 for a + * timeline drm_syncobj is invalid as it turns a drm_syncobj into a + * binary one. */ __u64 values_ptr; };
+/** + * struct drm_i915_gem_execbuffer2 - Structure for execbuff submission + */ struct drm_i915_gem_execbuffer2 { - /** - * List of gem_exec_object2 structs - */ + /** @buffers_ptr: Pointer to a list of gem_exec_object2 structs */ __u64 buffers_ptr; + + /** @buffer_count: Number of elements in @buffers_ptr array */ __u32 buffer_count;
- /** Offset in the batchbuffer to start execution from. */ + /** + * @batch_start_offset: Offset in the batchbuffer to start execution + * from. + */ __u32 batch_start_offset; - /** Bytes used in batchbuffer from batch_start_offset */ + + /** @batch_len: Bytes used in batchbuffer from batch_start_offset */ __u32 batch_len; + + /** @DR1: deprecated */ __u32 DR1; + + /** @DR4: deprecated */ __u32 DR4; + + /** @num_cliprects: See @cliprects_ptr */ __u32 num_cliprects; + /** - * This is a struct drm_clip_rect *cliprects if I915_EXEC_FENCE_ARRAY - * & I915_EXEC_USE_EXTENSIONS are not set. + * @cliprects_ptr: Kernel clipping was a DRI1 misfeature. + * + * It is invalid to use this field if I915_EXEC_FENCE_ARRAY or + * I915_EXEC_USE_EXTENSIONS flags are not set. * * If I915_EXEC_FENCE_ARRAY is set, then this is a pointer to an array - * of struct drm_i915_gem_exec_fence and num_cliprects is the length - * of the array. + * of &drm_i915_gem_exec_fence and @num_cliprects is the length of the + * array. * * If I915_EXEC_USE_EXTENSIONS is set, then this is a pointer to a - * single struct i915_user_extension and num_cliprects is 0. + * single &i915_user_extension and num_cliprects is 0. */ __u64 cliprects_ptr; + + /** @flags: Execbuff flags */ + __u64 flags; #define I915_EXEC_RING_MASK (0x3f) #define I915_EXEC_DEFAULT (0<<0) #define I915_EXEC_RENDER (1<<0) @@ -1326,10 +1371,6 @@ struct drm_i915_gem_execbuffer2 { #define I915_EXEC_CONSTANTS_REL_GENERAL (0<<6) /* default */ #define I915_EXEC_CONSTANTS_ABSOLUTE (1<<6) #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */ - __u64 flags; - __u64 rsvd1; /* now used for context info */ - __u64 rsvd2; -};
/** Resets the SO write offset registers for transform feedback on gen7. */ #define I915_EXEC_GEN7_SOL_RESET (1<<8) @@ -1432,9 +1473,23 @@ struct drm_i915_gem_execbuffer2 { * drm_i915_gem_execbuffer_ext enum. */ #define I915_EXEC_USE_EXTENSIONS (1 << 21) - #define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
+ /** @rsvd1: Context id */ + __u64 rsvd1; + + /** + * @rsvd2: in and out sync_file file descriptors. + * + * When I915_EXEC_FENCE_IN or I915_EXEC_FENCE_SUBMIT flag is set, the + * lower 32 bits of this field will have the in sync_file fd (input). + * + * When I915_EXEC_FENCE_OUT flag is set, the upper 32 bits of this + * field will have the out sync_file fd (output). + */ + __u64 rsvd2; +}; + #define I915_EXEC_CONTEXT_ID_MASK (0xffffffff) #define i915_execbuffer2_set_context_id(eb2, context) \ (eb2).rsvd1 = context & I915_EXEC_CONTEXT_ID_MASK @@ -1814,13 +1869,32 @@ struct drm_i915_gem_context_create { __u32 pad; };
+/** + * struct drm_i915_gem_context_create_ext - Structure for creating contexts. + */ struct drm_i915_gem_context_create_ext { - __u32 ctx_id; /* output: id of new context*/ + /** @ctx_id: Id of the created context (output) */ + __u32 ctx_id; + + /** + * @flags: Supported flags are, + * + * I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS: + * + * Extensions may be appended to this structure and driver must check + * for those. + * + * I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE + * + * Created context will have single timeline. + */ __u32 flags; #define I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS (1u << 0) #define I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE (1u << 1) #define I915_CONTEXT_CREATE_FLAGS_UNKNOWN \ (-(I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE << 1)) + + /** @extensions: Zero-terminated chain of extensions. */ __u64 extensions; };
@@ -2387,7 +2461,9 @@ struct drm_i915_gem_context_destroy { __u32 pad; };
-/* +/** + * struct drm_i915_gem_vm_control - Structure to create or destroy VM. + * * DRM_I915_GEM_VM_CREATE - * * Create a new virtual memory address space (ppGTT) for use within a context @@ -2397,20 +2473,23 @@ struct drm_i915_gem_context_destroy { * The id of new VM (bound to the fd) for use with I915_CONTEXT_PARAM_VM is * returned in the outparam @id. * - * No flags are defined, with all bits reserved and must be zero. - * * An extension chain maybe provided, starting with @extensions, and terminated * by the @next_extension being 0. Currently, no extensions are defined. * * DRM_I915_GEM_VM_DESTROY - * - * Destroys a previously created VM id, specified in @id. + * Destroys a previously created VM id, specified in @vm_id. * * No extensions or flags are allowed currently, and so must be zero. */ struct drm_i915_gem_vm_control { + /** @extensions: Zero-terminated chain of extensions. */ __u64 extensions; + + /** @flags: reserved for future usage, currently MBZ */ __u32 flags; + + /** @vm_id: Id of the VM created or to be destroyed */ __u32 vm_id; };
On Tue, 17 May 2022 at 19:32, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
Add some missing i915 upai documentation which the new i915 VM_BIND feature documentation will be refer to.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
include/uapi/drm/i915_drm.h | 153 +++++++++++++++++++++++++++--------- 1 file changed, 116 insertions(+), 37 deletions(-)
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h index a2def7b27009..8c834a31b56f 100644 --- a/include/uapi/drm/i915_drm.h +++ b/include/uapi/drm/i915_drm.h @@ -751,9 +751,16 @@ typedef struct drm_i915_irq_wait {
/* Must be kept compact -- no holes and well documented */
+/**
- typedef drm_i915_getparam_t - Driver parameter query structure.
This one looks funny in the rendered html for some reason, since it doesn't seem to emit the @param and @value, I guess it doesn't really understand typedef <struct> ?
Maybe make this "struct drm_i915_getparam - Driver parameter query structure." ?
- */
typedef struct drm_i915_getparam {
/** @param: Driver parameter to query. */ __s32 param;
/*
/**
* @value: Address of memory where queried value should be put.
* * WARNING: Using pointers instead of fixed-size u64 means we need to write * compat32 code. Don't repeat this mistake. */
@@ -1239,76 +1246,114 @@ struct drm_i915_gem_exec_object2 { __u64 rsvd2; };
+/**
- struct drm_i915_gem_exec_fence - An input or output fence for the execbuff
s/execbuff/execbuf/, at least that seems to be what we use elsewhere, AFAICT.
- ioctl.
- The request will wait for input fence to signal before submission.
- The returned output fence will be signaled after the completion of the
- request.
- */
struct drm_i915_gem_exec_fence {
/**
* User's handle for a drm_syncobj to wait on or signal.
*/
/** @handle: User's handle for a drm_syncobj to wait on or signal. */ __u32 handle;
/**
* @flags: Supported flags are,
are:
*
* I915_EXEC_FENCE_WAIT:
* Wait for the input fence before request submission.
*
* I915_EXEC_FENCE_SIGNAL:
* Return request completion fence as output
*/
__u32 flags;
#define I915_EXEC_FENCE_WAIT (1<<0) #define I915_EXEC_FENCE_SIGNAL (1<<1) #define __I915_EXEC_FENCE_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_SIGNAL << 1))
__u32 flags;
};
-/*
- See drm_i915_gem_execbuffer_ext_timeline_fences.
- */
-#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
-/* +/**
- struct drm_i915_gem_execbuffer_ext_timeline_fences - Timeline fences
- for execbuff.
*/
- This structure describes an array of drm_syncobj and associated points for
- timeline variants of drm_syncobj. It is invalid to append this structure to
- the execbuf if I915_EXEC_FENCE_ARRAY is set.
struct drm_i915_gem_execbuffer_ext_timeline_fences { +#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
/** @base: Extension link. See struct i915_user_extension. */ struct i915_user_extension base; /**
* Number of element in the handles_ptr & value_ptr arrays.
* @fence_count: Number of element in the @handles_ptr & @value_ptr
s/element/elements/
* arrays. */ __u64 fence_count; /**
* Pointer to an array of struct drm_i915_gem_exec_fence of length
* fence_count.
* @handles_ptr: Pointer to an array of struct drm_i915_gem_exec_fence
* of length @fence_count. */ __u64 handles_ptr; /**
* Pointer to an array of u64 values of length fence_count. Values
* must be 0 for a binary drm_syncobj. A Value of 0 for a timeline
* drm_syncobj is invalid as it turns a drm_syncobj into a binary one.
* @values_ptr: Pointer to an array of u64 values of length
* @fence_count.
* Values must be 0 for a binary drm_syncobj. A Value of 0 for a
* timeline drm_syncobj is invalid as it turns a drm_syncobj into a
* binary one. */ __u64 values_ptr;
};
+/**
- struct drm_i915_gem_execbuffer2 - Structure for execbuff submission
- */
struct drm_i915_gem_execbuffer2 {
/**
* List of gem_exec_object2 structs
*/
/** @buffers_ptr: Pointer to a list of gem_exec_object2 structs */ __u64 buffers_ptr;
/** @buffer_count: Number of elements in @buffers_ptr array */ __u32 buffer_count;
/** Offset in the batchbuffer to start execution from. */
/**
* @batch_start_offset: Offset in the batchbuffer to start execution
* from.
*/ __u32 batch_start_offset;
/** Bytes used in batchbuffer from batch_start_offset */
/** @batch_len: Bytes used in batchbuffer from batch_start_offset */
"Length in bytes of the batchbuffer, otherwise assumed to be the object size if zero, starting from the @batch_start_offset."
__u32 batch_len;
/** @DR1: deprecated */ __u32 DR1;
/** @DR4: deprecated */ __u32 DR4;
/** @num_cliprects: See @cliprects_ptr */ __u32 num_cliprects;
/**
* This is a struct drm_clip_rect *cliprects if I915_EXEC_FENCE_ARRAY
* & I915_EXEC_USE_EXTENSIONS are not set.
* @cliprects_ptr: Kernel clipping was a DRI1 misfeature.
*
* It is invalid to use this field if I915_EXEC_FENCE_ARRAY or
* I915_EXEC_USE_EXTENSIONS flags are not set. * * If I915_EXEC_FENCE_ARRAY is set, then this is a pointer to an array
* of struct drm_i915_gem_exec_fence and num_cliprects is the length
* of the array.
* of &drm_i915_gem_exec_fence and @num_cliprects is the length of the
* array. * * If I915_EXEC_USE_EXTENSIONS is set, then this is a pointer to a
* single struct i915_user_extension and num_cliprects is 0.
* single &i915_user_extension and num_cliprects is 0. */ __u64 cliprects_ptr;
/** @flags: Execbuff flags */
s/Execbuff/Execbuf/
Could maybe document the I915_EXEC_* also, or maybe not ;)
__u64 flags;
#define I915_EXEC_RING_MASK (0x3f) #define I915_EXEC_DEFAULT (0<<0) #define I915_EXEC_RENDER (1<<0) @@ -1326,10 +1371,6 @@ struct drm_i915_gem_execbuffer2 { #define I915_EXEC_CONSTANTS_REL_GENERAL (0<<6) /* default */ #define I915_EXEC_CONSTANTS_ABSOLUTE (1<<6) #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */
__u64 flags;
__u64 rsvd1; /* now used for context info */
__u64 rsvd2;
-};
/** Resets the SO write offset registers for transform feedback on gen7. */ #define I915_EXEC_GEN7_SOL_RESET (1<<8) @@ -1432,9 +1473,23 @@ struct drm_i915_gem_execbuffer2 {
- drm_i915_gem_execbuffer_ext enum.
*/ #define I915_EXEC_USE_EXTENSIONS (1 << 21)
#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
/** @rsvd1: Context id */
__u64 rsvd1;
/**
* @rsvd2: in and out sync_file file descriptors.
*
* When I915_EXEC_FENCE_IN or I915_EXEC_FENCE_SUBMIT flag is set, the
* lower 32 bits of this field will have the in sync_file fd (input).
*
* When I915_EXEC_FENCE_OUT flag is set, the upper 32 bits of this
* field will have the out sync_file fd (output).
*/
__u64 rsvd2;
+};
#define I915_EXEC_CONTEXT_ID_MASK (0xffffffff) #define i915_execbuffer2_set_context_id(eb2, context) \ (eb2).rsvd1 = context & I915_EXEC_CONTEXT_ID_MASK @@ -1814,13 +1869,32 @@ struct drm_i915_gem_context_create { __u32 pad; };
+/**
- struct drm_i915_gem_context_create_ext - Structure for creating contexts.
- */
struct drm_i915_gem_context_create_ext {
__u32 ctx_id; /* output: id of new context*/
/** @ctx_id: Id of the created context (output) */
__u32 ctx_id;
/**
* @flags: Supported flags are,
are:
*
* I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS:
*
* Extensions may be appended to this structure and driver must check
* for those.
Maybe add "See @extensions.", and then....
*
* I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE
*
* Created context will have single timeline.
*/ __u32 flags;
#define I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS (1u << 0) #define I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE (1u << 1) #define I915_CONTEXT_CREATE_FLAGS_UNKNOWN \ (-(I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE << 1))
/** @extensions: Zero-terminated chain of extensions. */
...here perhaps list the extensions, and maybe also move the #define for each here? See for example @extensions in drm_i915_gem_create_ext.
Reviewed-by: Matthew Auld matthew.auld@intel.com
__u64 extensions;
};
@@ -2387,7 +2461,9 @@ struct drm_i915_gem_context_destroy { __u32 pad; };
-/* +/**
- struct drm_i915_gem_vm_control - Structure to create or destroy VM.
- DRM_I915_GEM_VM_CREATE -
- Create a new virtual memory address space (ppGTT) for use within a context
@@ -2397,20 +2473,23 @@ struct drm_i915_gem_context_destroy {
- The id of new VM (bound to the fd) for use with I915_CONTEXT_PARAM_VM is
- returned in the outparam @id.
- No flags are defined, with all bits reserved and must be zero.
- An extension chain maybe provided, starting with @extensions, and terminated
- by the @next_extension being 0. Currently, no extensions are defined.
- DRM_I915_GEM_VM_DESTROY -
- Destroys a previously created VM id, specified in @id.
*/
- Destroys a previously created VM id, specified in @vm_id.
- No extensions or flags are allowed currently, and so must be zero.
struct drm_i915_gem_vm_control {
/** @extensions: Zero-terminated chain of extensions. */ __u64 extensions;
/** @flags: reserved for future usage, currently MBZ */ __u32 flags;
/** @vm_id: Id of the VM created or to be destroyed */ __u32 vm_id;
};
-- 2.21.0.rc0.32.g243a4c7e27
On Wed, Jun 08, 2022 at 12:24:04PM +0100, Matthew Auld wrote:
On Tue, 17 May 2022 at 19:32, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
Add some missing i915 upai documentation which the new i915 VM_BIND feature documentation will be refer to.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
include/uapi/drm/i915_drm.h | 153 +++++++++++++++++++++++++++--------- 1 file changed, 116 insertions(+), 37 deletions(-)
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h index a2def7b27009..8c834a31b56f 100644 --- a/include/uapi/drm/i915_drm.h +++ b/include/uapi/drm/i915_drm.h @@ -751,9 +751,16 @@ typedef struct drm_i915_irq_wait {
/* Must be kept compact -- no holes and well documented */
+/**
- typedef drm_i915_getparam_t - Driver parameter query structure.
This one looks funny in the rendered html for some reason, since it doesn't seem to emit the @param and @value, I guess it doesn't really understand typedef <struct> ?
Maybe make this "struct drm_i915_getparam - Driver parameter query structure." ?
Thanks Matt. Yah, there doesn't seems to be a good way to add kernel doc for this kind of declaration. 'struct drm_i915_getparam' also didn't help. I was able to fix it by first defining the structure and then adding a typedef for it. Not sure if that has any value, but at least we can get kernel doc for that.
- */
typedef struct drm_i915_getparam {
/** @param: Driver parameter to query. */ __s32 param;
/*
/**
* @value: Address of memory where queried value should be put.
* * WARNING: Using pointers instead of fixed-size u64 means we need to write * compat32 code. Don't repeat this mistake. */
@@ -1239,76 +1246,114 @@ struct drm_i915_gem_exec_object2 { __u64 rsvd2; };
+/**
- struct drm_i915_gem_exec_fence - An input or output fence for the execbuff
s/execbuff/execbuf/, at least that seems to be what we use elsewhere, AFAICT.
- ioctl.
- The request will wait for input fence to signal before submission.
- The returned output fence will be signaled after the completion of the
- request.
- */
struct drm_i915_gem_exec_fence {
/**
* User's handle for a drm_syncobj to wait on or signal.
*/
/** @handle: User's handle for a drm_syncobj to wait on or signal. */ __u32 handle;
/**
* @flags: Supported flags are,
are:
*
* I915_EXEC_FENCE_WAIT:
* Wait for the input fence before request submission.
*
* I915_EXEC_FENCE_SIGNAL:
* Return request completion fence as output
*/
__u32 flags;
#define I915_EXEC_FENCE_WAIT (1<<0) #define I915_EXEC_FENCE_SIGNAL (1<<1) #define __I915_EXEC_FENCE_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_SIGNAL << 1))
__u32 flags;
};
-/*
- See drm_i915_gem_execbuffer_ext_timeline_fences.
- */
-#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
-/* +/**
- struct drm_i915_gem_execbuffer_ext_timeline_fences - Timeline fences
- for execbuff.
*/
- This structure describes an array of drm_syncobj and associated points for
- timeline variants of drm_syncobj. It is invalid to append this structure to
- the execbuf if I915_EXEC_FENCE_ARRAY is set.
struct drm_i915_gem_execbuffer_ext_timeline_fences { +#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
/** @base: Extension link. See struct i915_user_extension. */ struct i915_user_extension base; /**
* Number of element in the handles_ptr & value_ptr arrays.
* @fence_count: Number of element in the @handles_ptr & @value_ptr
s/element/elements/
* arrays. */ __u64 fence_count; /**
* Pointer to an array of struct drm_i915_gem_exec_fence of length
* fence_count.
* @handles_ptr: Pointer to an array of struct drm_i915_gem_exec_fence
* of length @fence_count. */ __u64 handles_ptr; /**
* Pointer to an array of u64 values of length fence_count. Values
* must be 0 for a binary drm_syncobj. A Value of 0 for a timeline
* drm_syncobj is invalid as it turns a drm_syncobj into a binary one.
* @values_ptr: Pointer to an array of u64 values of length
* @fence_count.
* Values must be 0 for a binary drm_syncobj. A Value of 0 for a
* timeline drm_syncobj is invalid as it turns a drm_syncobj into a
* binary one. */ __u64 values_ptr;
};
+/**
- struct drm_i915_gem_execbuffer2 - Structure for execbuff submission
- */
struct drm_i915_gem_execbuffer2 {
/**
* List of gem_exec_object2 structs
*/
/** @buffers_ptr: Pointer to a list of gem_exec_object2 structs */ __u64 buffers_ptr;
/** @buffer_count: Number of elements in @buffers_ptr array */ __u32 buffer_count;
/** Offset in the batchbuffer to start execution from. */
/**
* @batch_start_offset: Offset in the batchbuffer to start execution
* from.
*/ __u32 batch_start_offset;
/** Bytes used in batchbuffer from batch_start_offset */
/** @batch_len: Bytes used in batchbuffer from batch_start_offset */
"Length in bytes of the batchbuffer, otherwise assumed to be the object size if zero, starting from the @batch_start_offset."
__u32 batch_len;
/** @DR1: deprecated */ __u32 DR1;
/** @DR4: deprecated */ __u32 DR4;
/** @num_cliprects: See @cliprects_ptr */ __u32 num_cliprects;
/**
* This is a struct drm_clip_rect *cliprects if I915_EXEC_FENCE_ARRAY
* & I915_EXEC_USE_EXTENSIONS are not set.
* @cliprects_ptr: Kernel clipping was a DRI1 misfeature.
*
* It is invalid to use this field if I915_EXEC_FENCE_ARRAY or
* I915_EXEC_USE_EXTENSIONS flags are not set. * * If I915_EXEC_FENCE_ARRAY is set, then this is a pointer to an array
* of struct drm_i915_gem_exec_fence and num_cliprects is the length
* of the array.
* of &drm_i915_gem_exec_fence and @num_cliprects is the length of the
* array. * * If I915_EXEC_USE_EXTENSIONS is set, then this is a pointer to a
* single struct i915_user_extension and num_cliprects is 0.
* single &i915_user_extension and num_cliprects is 0. */ __u64 cliprects_ptr;
/** @flags: Execbuff flags */
s/Execbuff/Execbuf/
Could maybe document the I915_EXEC_* also, or maybe not ;)
We no longer need to refer to execbuf2 as vm_bind will have its own new execbuf3. But will keep the already added execbuf2 documentation.
__u64 flags;
#define I915_EXEC_RING_MASK (0x3f) #define I915_EXEC_DEFAULT (0<<0) #define I915_EXEC_RENDER (1<<0) @@ -1326,10 +1371,6 @@ struct drm_i915_gem_execbuffer2 { #define I915_EXEC_CONSTANTS_REL_GENERAL (0<<6) /* default */ #define I915_EXEC_CONSTANTS_ABSOLUTE (1<<6) #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */
__u64 flags;
__u64 rsvd1; /* now used for context info */
__u64 rsvd2;
-};
/** Resets the SO write offset registers for transform feedback on gen7. */ #define I915_EXEC_GEN7_SOL_RESET (1<<8) @@ -1432,9 +1473,23 @@ struct drm_i915_gem_execbuffer2 {
- drm_i915_gem_execbuffer_ext enum.
*/ #define I915_EXEC_USE_EXTENSIONS (1 << 21)
#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
/** @rsvd1: Context id */
__u64 rsvd1;
/**
* @rsvd2: in and out sync_file file descriptors.
*
* When I915_EXEC_FENCE_IN or I915_EXEC_FENCE_SUBMIT flag is set, the
* lower 32 bits of this field will have the in sync_file fd (input).
*
* When I915_EXEC_FENCE_OUT flag is set, the upper 32 bits of this
* field will have the out sync_file fd (output).
*/
__u64 rsvd2;
+};
#define I915_EXEC_CONTEXT_ID_MASK (0xffffffff) #define i915_execbuffer2_set_context_id(eb2, context) \ (eb2).rsvd1 = context & I915_EXEC_CONTEXT_ID_MASK @@ -1814,13 +1869,32 @@ struct drm_i915_gem_context_create { __u32 pad; };
+/**
- struct drm_i915_gem_context_create_ext - Structure for creating contexts.
- */
struct drm_i915_gem_context_create_ext {
__u32 ctx_id; /* output: id of new context*/
/** @ctx_id: Id of the created context (output) */
__u32 ctx_id;
/**
* @flags: Supported flags are,
are:
*
* I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS:
*
* Extensions may be appended to this structure and driver must check
* for those.
Maybe add "See @extensions.", and then....
*
* I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE
*
* Created context will have single timeline.
*/ __u32 flags;
#define I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS (1u << 0) #define I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE (1u << 1) #define I915_CONTEXT_CREATE_FLAGS_UNKNOWN \ (-(I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE << 1))
/** @extensions: Zero-terminated chain of extensions. */
...here perhaps list the extensions, and maybe also move the #define for each here? See for example @extensions in drm_i915_gem_create_ext.
Ok, will address all your comments above.
Niranjana
Reviewed-by: Matthew Auld matthew.auld@intel.com
__u64 extensions;
};
@@ -2387,7 +2461,9 @@ struct drm_i915_gem_context_destroy { __u32 pad; };
-/* +/**
- struct drm_i915_gem_vm_control - Structure to create or destroy VM.
- DRM_I915_GEM_VM_CREATE -
- Create a new virtual memory address space (ppGTT) for use within a context
@@ -2397,20 +2473,23 @@ struct drm_i915_gem_context_destroy {
- The id of new VM (bound to the fd) for use with I915_CONTEXT_PARAM_VM is
- returned in the outparam @id.
- No flags are defined, with all bits reserved and must be zero.
- An extension chain maybe provided, starting with @extensions, and terminated
- by the @next_extension being 0. Currently, no extensions are defined.
- DRM_I915_GEM_VM_DESTROY -
- Destroys a previously created VM id, specified in @id.
*/
- Destroys a previously created VM id, specified in @vm_id.
- No extensions or flags are allowed currently, and so must be zero.
struct drm_i915_gem_vm_control {
/** @extensions: Zero-terminated chain of extensions. */ __u64 extensions;
/** @flags: reserved for future usage, currently MBZ */ __u32 flags;
/** @vm_id: Id of the VM created or to be destroyed */ __u32 vm_id;
};
-- 2.21.0.rc0.32.g243a4c7e27
VM_BIND and related uapi definitions
v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com --- Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/* + * Copyright © 2022 Intel Corporation + */ + +/** + * DOC: I915_PARAM_HAS_VM_BIND + * + * VM_BIND feature availability. + * See typedef drm_i915_getparam_t param. + */ +#define I915_PARAM_HAS_VM_BIND 57 + +/** + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND + * + * Flag to opt-in for VM_BIND mode of binding during VM creation. + * See struct drm_i915_gem_vm_control flags. + * + * A VM in VM_BIND mode will not support the older execbuff mode of binding. + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the + * &drm_i915_gem_execbuffer2.buffer_count must be 0). + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and + * &drm_i915_gem_execbuffer2.batch_len must be 0. + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided + * to pass in the batch buffer addresses. + * + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0 + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0. + */ +#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0) + +/** + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING + * + * Flag to declare context as long running. + * See struct drm_i915_gem_context_create_ext flags. + * + * Usage of dma-fence expects that they complete in reasonable amount of time. + * Compute on the other hand can be long running. Hence it is not appropriate + * for compute contexts to export request completion dma-fence to user. + * The dma-fence usage will be limited to in-kernel consumption only. + * Compute contexts need to use user/memory fence. + * + * So, long running contexts do not support output fences. Hence, + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected + * to be not used. + * + * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped + * to long running contexts. + */ +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2) + +/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f + +#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence) + +/** + * struct drm_i915_gem_vm_bind - VA to object mapping to bind. + * + * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU + * virtual address (VA) range to the section of an object that should be bound + * in the device page table of the specified address space (VM). + * The VA range specified must be unique (ie., not currently bound) and can + * be mapped to whole object or a section of the object (partial binding). + * Multiple VA mappings can be created to the same section of the object + * (aliasing). + */ +struct drm_i915_gem_vm_bind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id; + + /** @handle: Object handle */ + __u32 handle; + + /** @start: Virtual Address start to bind */ + __u64 start; + + /** @offset: Offset in object to bind */ + __u64 offset; + + /** @length: Length of mapping to bind */ + __u64 length; + + /** + * @flags: Supported flags are, + * + * I915_GEM_VM_BIND_READONLY: + * Mapping is read-only. + * + * I915_GEM_VM_BIND_CAPTURE: + * Capture this mapping in the dump upon GPU error. + */ + __u64 flags; +#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1) + + /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +}; + +/** + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind. + * + * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual + * address (VA) range that should be unbound from the device page table of the + * specified address space (VM). The specified VA range must match one of the + * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind + * completion. + */ +struct drm_i915_gem_vm_unbind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id; + + /** @rsvd: Reserved for future use; must be zero. */ + __u32 rsvd; + + /** @start: Virtual Address start to unbind */ + __u64 start; + + /** @length: Length of mapping to unbind */ + __u64 length; + + /** @flags: reserved for future usage, currently MBZ */ + __u64 flags; + + /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +}; + +/** + * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind + * or the vm_unbind work. + * + * The vm_bind or vm_unbind aync worker will wait for input fence to signal + * before starting the binding or unbinding. + * + * The vm_bind or vm_unbind async worker will signal the returned output fence + * after the completion of binding or unbinding. + */ +struct drm_i915_vm_bind_fence { + /** @handle: User's handle for a drm_syncobj to wait on or signal. */ + __u32 handle; + + /** + * @flags: Supported flags are, + * + * I915_VM_BIND_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +}; + +/** + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind + * and vm_unbind. + * + * This structure describes an array of timeline drm_syncobj and associated + * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's + * can be input or output fences (See struct drm_i915_vm_bind_fence). + */ +struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** + * @fence_count: Number of elements in the @handles_ptr & @value_ptr + * arrays. + */ + __u64 fence_count; + + /** + * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence + * of length @fence_count. + */ + __u64 handles_ptr; + + /** + * @values_ptr: Pointer to an array of u64 values of length + * @fence_count. + * Values must be 0 for a binary drm_syncobj. A Value of 0 for a + * timeline drm_syncobj is invalid as it turns a drm_syncobj into a + * binary one. + */ + __u64 values_ptr; +}; + +/** + * struct drm_i915_vm_bind_user_fence - An input or output user fence for the + * vm_bind or the vm_unbind work. + * + * The vm_bind or vm_unbind aync worker will wait for the input fence (value at + * @addr to become equal to @val) before starting the binding or unbinding. + * + * The vm_bind or vm_unbind async worker will signal the output fence after + * the completion of binding or unbinding by writing @val to memory location at + * @addr + */ +struct drm_i915_vm_bind_user_fence { + /** @addr: User/Memory fence qword aligned process virtual address */ + __u64 addr; + + /** @val: User/Memory fence value to be written after bind completion */ + __u64 val; + + /** + * @flags: Supported flags are, + * + * I915_VM_BIND_USER_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_USER_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ + (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) +}; + +/** + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind + * and vm_unbind. + * + * These user fences can be input or output fences + * (See struct drm_i915_vm_bind_user_fence). + */ +struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** @fence_count: Number of elements in the @user_fence_ptr array. */ + __u64 fence_count; + + /** + * @user_fence_ptr: Pointer to an array of + * struct drm_i915_vm_bind_user_fence of length @fence_count. + */ + __u64 user_fence_ptr; +}; + +/** + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer + * gpu virtual addresses. + * + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension + * must always be appended in the VM_BIND mode and it will be an error to + * append this extension in older non-VM_BIND mode. + */ +struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** @count: Number of addresses in the addr array. */ + __u32 count; + + /** @addr: An array of batch gpu virtual addresses. */ + __u64 addr[0]; +}; + +/** + * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion + * signaling extension. + * + * This extension allows user to attach a user fence (@addr, @value pair) to an + * execbuf to be signaled by the command streamer after the completion of first + * level batch, by writing the @value at specified @addr and triggering an + * interrupt. + * User can either poll for this user fence to signal or can also wait on it + * with i915_gem_wait_user_fence ioctl. + * This is very much usefaul for long running contexts where waiting on dma-fence + * by user (like i915_gem_wait ioctl) is not supported. + */ +struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** + * @addr: User/Memory fence qword aligned GPU virtual address. + * + * Address has to be a valid GPU virtual address at the time of + * first level batch completion. + */ + __u64 addr; + + /** + * @value: User/Memory fence Value to be written to above address + * after first level batch completes. + */ + __u64 value; + + /** @rsvd: Reserved for future extensions, MBZ */ + __u64 rsvd; +}; + +/** + * struct drm_i915_gem_create_ext_vm_private - Extension to make the object + * private to the specified VM. + * + * See struct drm_i915_gem_create_ext. + */ +struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** @vm_id: Id of the VM to which the object is private */ + __u32 vm_id; +}; + +/** + * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence. + * + * User/Memory fence can be woken up either by: + * + * 1. GPU context indicated by @ctx_id, or, + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT. + * @ctx_id is ignored when this flag is set. + * + * Wakeup condition is, + * ``((*addr & mask) op (value & mask))`` + * + * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>` + */ +struct drm_i915_gem_wait_user_fence { + /** @extensions: Zero-terminated chain of extensions. */ + __u64 extensions; + + /** @addr: User/Memory fence address */ + __u64 addr; + + /** @ctx_id: Id of the Context which will signal the fence. */ + __u32 ctx_id; + + /** @op: Wakeup condition operator */ + __u16 op; +#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7 + + /** + * @flags: Supported flags are, + * + * I915_UFENCE_WAIT_SOFT: + * + * To be woken up by i915 driver async worker (not by GPU). + * + * I915_UFENCE_WAIT_ABSTIME: + * + * Wait timeout specified as absolute time. + */ + __u16 flags; +#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2 + + /** @value: Wakeup value */ + __u64 value; + + /** @mask: Wakeup mask */ + __u64 mask; +#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull + + /** + * @timeout: Wait timeout in nanoseconds. + * + * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the + * absolute time in nsec. + */ + __s64 timeout; +};
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
VM_BIND and related uapi definitions
v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/**
- DOC: I915_PARAM_HAS_VM_BIND
- VM_BIND feature availability.
- See typedef drm_i915_getparam_t param.
- */
+#define I915_PARAM_HAS_VM_BIND 57
+/**
- DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
- Flag to opt-in for VM_BIND mode of binding during VM creation.
- See struct drm_i915_gem_vm_control flags.
- A VM in VM_BIND mode will not support the older execbuff mode of binding.
- In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
- &drm_i915_gem_execbuffer2.buffer_count must be 0).
- Also, &drm_i915_gem_execbuffer2.batch_start_offset and
- &drm_i915_gem_execbuffer2.batch_len must be 0.
- DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
- to pass in the batch buffer addresses.
- Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
- I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
- (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
- set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
- The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
- of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
- */
From that description, it seems we have:
struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using extensions __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! __u64 flags; -> some flags must be 0 (new) __u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };
Based on that, why can't we just get drm_i915_gem_execbuffer3 instead of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how the VM was created (which is an entirely different ioctl).
From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll need to create a -11 entry just to deal with this ioctl.
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0) + +/** + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING + * + * Flag to declare context as long running. + * See struct drm_i915_gem_context_create_ext flags. + * + * Usage of dma-fence expects that they complete in reasonable amount of time. + * Compute on the other hand can be long running. Hence it is not appropriate + * for compute contexts to export request completion dma-fence to user. + * The dma-fence usage will be limited to in-kernel consumption only. + * Compute contexts need to use user/memory fence. + * + * So, long running contexts do not support output fences. Hence, + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected + * to be not used. + * + * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped + * to long running contexts. + */ +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2) + +/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f + +#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence) + +/** + * struct drm_i915_gem_vm_bind - VA to object mapping to bind. + * + * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU + * virtual address (VA) range to the section of an object that should be bound + * in the device page table of the specified address space (VM). + * The VA range specified must be unique (ie., not currently bound) and can + * be mapped to whole object or a section of the object (partial binding). + * Multiple VA mappings can be created to the same section of the object + * (aliasing). + */ +struct drm_i915_gem_vm_bind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id; + + /** @handle: Object handle */ + __u32 handle; + + /** @start: Virtual Address start to bind */ + __u64 start; + + /** @offset: Offset in object to bind */ + __u64 offset; + + /** @length: Length of mapping to bind */ + __u64 length; + + /** + * @flags: Supported flags are, + * + * I915_GEM_VM_BIND_READONLY: + * Mapping is read-only. + * + * I915_GEM_VM_BIND_CAPTURE: + * Capture this mapping in the dump upon GPU error. + */ + __u64 flags; +#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1) + + /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +}; + +/** + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind. + * + * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual + * address (VA) range that should be unbound from the device page table of the + * specified address space (VM). The specified VA range must match one of the + * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind + * completion. + */ +struct drm_i915_gem_vm_unbind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id; + + /** @rsvd: Reserved for future use; must be zero. */ + __u32 rsvd; + + /** @start: Virtual Address start to unbind */ + __u64 start; + + /** @length: Length of mapping to unbind */ + __u64 length; + + /** @flags: reserved for future usage, currently MBZ */ + __u64 flags; + + /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +}; + +/** + * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind + * or the vm_unbind work. + * + * The vm_bind or vm_unbind aync worker will wait for input fence to signal + * before starting the binding or unbinding. + * + * The vm_bind or vm_unbind async worker will signal the returned output fence + * after the completion of binding or unbinding. + */ +struct drm_i915_vm_bind_fence { + /** @handle: User's handle for a drm_syncobj to wait on or signal. */ + __u32 handle; + + /** + * @flags: Supported flags are, + * + * I915_VM_BIND_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +}; + +/** + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind + * and vm_unbind. + * + * This structure describes an array of timeline drm_syncobj and associated + * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's + * can be input or output fences (See struct drm_i915_vm_bind_fence). + */ +struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** + * @fence_count: Number of elements in the @handles_ptr & @value_ptr + * arrays. + */ + __u64 fence_count; + + /** + * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence + * of length @fence_count. + */ + __u64 handles_ptr; + + /** + * @values_ptr: Pointer to an array of u64 values of length + * @fence_count. + * Values must be 0 for a binary drm_syncobj. A Value of 0 for a + * timeline drm_syncobj is invalid as it turns a drm_syncobj into a + * binary one. + */ + __u64 values_ptr; +}; + +/** + * struct drm_i915_vm_bind_user_fence - An input or output user fence for the + * vm_bind or the vm_unbind work. + * + * The vm_bind or vm_unbind aync worker will wait for the input fence (value at + * @addr to become equal to @val) before starting the binding or unbinding. + * + * The vm_bind or vm_unbind async worker will signal the output fence after + * the completion of binding or unbinding by writing @val to memory location at + * @addr + */ +struct drm_i915_vm_bind_user_fence { + /** @addr: User/Memory fence qword aligned process virtual address */ + __u64 addr; + + /** @val: User/Memory fence value to be written after bind completion */ + __u64 val; + + /** + * @flags: Supported flags are, + * + * I915_VM_BIND_USER_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_USER_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ + (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) +}; + +/** + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind + * and vm_unbind. + * + * These user fences can be input or output fences + * (See struct drm_i915_vm_bind_user_fence). + */ +struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** @fence_count: Number of elements in the @user_fence_ptr array. */ + __u64 fence_count; + + /** + * @user_fence_ptr: Pointer to an array of + * struct drm_i915_vm_bind_user_fence of length @fence_count. + */ + __u64 user_fence_ptr; +}; + +/** + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer + * gpu virtual addresses. + * + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension + * must always be appended in the VM_BIND mode and it will be an error to + * append this extension in older non-VM_BIND mode. + */ +struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** @count: Number of addresses in the addr array. */ + __u32 count; + + /** @addr: An array of batch gpu virtual addresses. */ + __u64 addr[0]; +}; + +/** + * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion + * signaling extension. + * + * This extension allows user to attach a user fence (@addr, @value pair) to an + * execbuf to be signaled by the command streamer after the completion of first + * level batch, by writing the @value at specified @addr and triggering an + * interrupt. + * User can either poll for this user fence to signal or can also wait on it + * with i915_gem_wait_user_fence ioctl. + * This is very much usefaul for long running contexts where waiting on dma-fence + * by user (like i915_gem_wait ioctl) is not supported. + */ +struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** + * @addr: User/Memory fence qword aligned GPU virtual address. + * + * Address has to be a valid GPU virtual address at the time of + * first level batch completion. + */ + __u64 addr; + + /** + * @value: User/Memory fence Value to be written to above address + * after first level batch completes. + */ + __u64 value; + + /** @rsvd: Reserved for future extensions, MBZ */ + __u64 rsvd; +}; + +/** + * struct drm_i915_gem_create_ext_vm_private - Extension to make the object + * private to the specified VM. + * + * See struct drm_i915_gem_create_ext. + */ +struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** @vm_id: Id of the VM to which the object is private */ + __u32 vm_id; +}; + +/** + * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence. + * + * User/Memory fence can be woken up either by: + * + * 1. GPU context indicated by @ctx_id, or, + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT. + * @ctx_id is ignored when this flag is set. + * + * Wakeup condition is, + * ``((*addr & mask) op (value & mask))`` + * + * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>` + */ +struct drm_i915_gem_wait_user_fence { + /** @extensions: Zero-terminated chain of extensions. */ + __u64 extensions; + + /** @addr: User/Memory fence address */ + __u64 addr; + + /** @ctx_id: Id of the Context which will signal the fence. */ + __u32 ctx_id; + + /** @op: Wakeup condition operator */ + __u16 op; +#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7 + + /** + * @flags: Supported flags are, + * + * I915_UFENCE_WAIT_SOFT: + * + * To be woken up by i915 driver async worker (not by GPU). + * + * I915_UFENCE_WAIT_ABSTIME: + * + * Wait timeout specified as absolute time. + */ + __u16 flags; +#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2 + + /** @value: Wakeup value */ + __u64 value; + + /** @mask: Wakeup mask */ + __u64 mask; +#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull + + /** + * @timeout: Wait timeout in nanoseconds. + * + * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the + * absolute time in nsec. + */ + __s64 timeout; +};
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
VM_BIND and related uapi definitions
v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/**
- DOC: I915_PARAM_HAS_VM_BIND
- VM_BIND feature availability.
- See typedef drm_i915_getparam_t param.
- */
+#define I915_PARAM_HAS_VM_BIND 57
+/**
- DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
- Flag to opt-in for VM_BIND mode of binding during VM creation.
- See struct drm_i915_gem_vm_control flags.
- A VM in VM_BIND mode will not support the older execbuff mode of binding.
- In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
- &drm_i915_gem_execbuffer2.buffer_count must be 0).
- Also, &drm_i915_gem_execbuffer2.batch_start_offset and
- &drm_i915_gem_execbuffer2.batch_len must be 0.
- DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
- to pass in the batch buffer addresses.
- Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
- I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
- (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
- set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
- The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
- of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
- */
From that description, it seems we have:
struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using extensions __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! __u64 flags; -> some flags must be 0 (new) __u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };
Based on that, why can't we just get drm_i915_gem_execbuffer3 instead of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how the VM was created (which is an entirely different ioctl).
From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll need to create a -11 entry just to deal with this ioctl.
The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).
Niranjana
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0)
+/**
- DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
- Flag to declare context as long running.
- See struct drm_i915_gem_context_create_ext flags.
- Usage of dma-fence expects that they complete in reasonable amount of time.
- Compute on the other hand can be long running. Hence it is not appropriate
- for compute contexts to export request completion dma-fence to user.
- The dma-fence usage will be limited to in-kernel consumption only.
- Compute contexts need to use user/memory fence.
- So, long running contexts do not support output fences. Hence,
- I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
- I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
- to be not used.
- DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
- to long running contexts.
- */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2)
+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f
+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+/**
- struct drm_i915_gem_vm_bind - VA to object mapping to bind.
- This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
- virtual address (VA) range to the section of an object that should be bound
- in the device page table of the specified address space (VM).
- The VA range specified must be unique (ie., not currently bound) and can
- be mapped to whole object or a section of the object (partial binding).
- Multiple VA mappings can be created to the same section of the object
- (aliasing).
- */
+struct drm_i915_gem_vm_bind {
/** @vm_id: VM (address space) id to bind */
__u32 vm_id;
/** @handle: Object handle */
__u32 handle;
/** @start: Virtual Address start to bind */
__u64 start;
/** @offset: Offset in object to bind */
__u64 offset;
/** @length: Length of mapping to bind */
__u64 length;
/**
* @flags: Supported flags are,
*
* I915_GEM_VM_BIND_READONLY:
* Mapping is read-only.
*
* I915_GEM_VM_BIND_CAPTURE:
* Capture this mapping in the dump upon GPU error.
*/
__u64 flags;
+#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1)
/** @extensions: 0-terminated chain of extensions for this mapping. */
__u64 extensions;
+};
+/**
- struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
- This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
- address (VA) range that should be unbound from the device page table of the
- specified address space (VM). The specified VA range must match one of the
- mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
- completion.
- */
+struct drm_i915_gem_vm_unbind {
/** @vm_id: VM (address space) id to bind */
__u32 vm_id;
/** @rsvd: Reserved for future use; must be zero. */
__u32 rsvd;
/** @start: Virtual Address start to unbind */
__u64 start;
/** @length: Length of mapping to unbind */
__u64 length;
/** @flags: reserved for future usage, currently MBZ */
__u64 flags;
/** @extensions: 0-terminated chain of extensions for this mapping. */
__u64 extensions;
+};
+/**
- struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
- or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for input fence to signal
- before starting the binding or unbinding.
- The vm_bind or vm_unbind async worker will signal the returned output fence
- after the completion of binding or unbinding.
- */
+struct drm_i915_vm_bind_fence {
/** @handle: User's handle for a drm_syncobj to wait on or signal. */
__u32 handle;
/**
* @flags: Supported flags are,
*
* I915_VM_BIND_FENCE_WAIT:
* Wait for the input fence before binding/unbinding
*
* I915_VM_BIND_FENCE_SIGNAL:
* Return bind/unbind completion fence as output
*/
__u32 flags;
+#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};
+/**
- struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
- and vm_unbind.
- This structure describes an array of timeline drm_syncobj and associated
- points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
- can be input or output fences (See struct drm_i915_vm_bind_fence).
- */
+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0
/** @base: Extension link. See struct i915_user_extension. */
struct i915_user_extension base;
/**
* @fence_count: Number of elements in the @handles_ptr & @value_ptr
* arrays.
*/
__u64 fence_count;
/**
* @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
* of length @fence_count.
*/
__u64 handles_ptr;
/**
* @values_ptr: Pointer to an array of u64 values of length
* @fence_count.
* Values must be 0 for a binary drm_syncobj. A Value of 0 for a
* timeline drm_syncobj is invalid as it turns a drm_syncobj into a
* binary one.
*/
__u64 values_ptr;
+};
+/**
- struct drm_i915_vm_bind_user_fence - An input or output user fence for the
- vm_bind or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for the input fence (value at
- @addr to become equal to @val) before starting the binding or unbinding.
- The vm_bind or vm_unbind async worker will signal the output fence after
- the completion of binding or unbinding by writing @val to memory location at
- @addr
- */
+struct drm_i915_vm_bind_user_fence {
/** @addr: User/Memory fence qword aligned process virtual address */
__u64 addr;
/** @val: User/Memory fence value to be written after bind completion */
__u64 val;
/**
* @flags: Supported flags are,
*
* I915_VM_BIND_USER_FENCE_WAIT:
* Wait for the input fence before binding/unbinding
*
* I915_VM_BIND_USER_FENCE_SIGNAL:
* Return bind/unbind completion fence as output
*/
__u32 flags;
+#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
+};
+/**
- struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
- and vm_unbind.
- These user fences can be input or output fences
- (See struct drm_i915_vm_bind_user_fence).
- */
+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1
/** @base: Extension link. See struct i915_user_extension. */
struct i915_user_extension base;
/** @fence_count: Number of elements in the @user_fence_ptr array. */
__u64 fence_count;
/**
* @user_fence_ptr: Pointer to an array of
* struct drm_i915_vm_bind_user_fence of length @fence_count.
*/
__u64 user_fence_ptr;
+};
+/**
- struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
- gpu virtual addresses.
- In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
- must always be appended in the VM_BIND mode and it will be an error to
- append this extension in older non-VM_BIND mode.
- */
+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1
/** @base: Extension link. See struct i915_user_extension. */
struct i915_user_extension base;
/** @count: Number of addresses in the addr array. */
__u32 count;
/** @addr: An array of batch gpu virtual addresses. */
__u64 addr[0];
+};
+/**
- struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
- signaling extension.
- This extension allows user to attach a user fence (@addr, @value pair) to an
- execbuf to be signaled by the command streamer after the completion of first
- level batch, by writing the @value at specified @addr and triggering an
- interrupt.
- User can either poll for this user fence to signal or can also wait on it
- with i915_gem_wait_user_fence ioctl.
- This is very much usefaul for long running contexts where waiting on dma-fence
- by user (like i915_gem_wait ioctl) is not supported.
- */
+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2
/** @base: Extension link. See struct i915_user_extension. */
struct i915_user_extension base;
/**
* @addr: User/Memory fence qword aligned GPU virtual address.
*
* Address has to be a valid GPU virtual address at the time of
* first level batch completion.
*/
__u64 addr;
/**
* @value: User/Memory fence Value to be written to above address
* after first level batch completes.
*/
__u64 value;
/** @rsvd: Reserved for future extensions, MBZ */
__u64 rsvd;
+};
+/**
- struct drm_i915_gem_create_ext_vm_private - Extension to make the object
- private to the specified VM.
- See struct drm_i915_gem_create_ext.
- */
+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2
/** @base: Extension link. See struct i915_user_extension. */
struct i915_user_extension base;
/** @vm_id: Id of the VM to which the object is private */
__u32 vm_id;
+};
+/**
- struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
- User/Memory fence can be woken up either by:
- GPU context indicated by @ctx_id, or,
- Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
- @ctx_id is ignored when this flag is set.
- Wakeup condition is,
- ``((*addr & mask) op (value & mask))``
- See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
- */
+struct drm_i915_gem_wait_user_fence {
/** @extensions: Zero-terminated chain of extensions. */
__u64 extensions;
/** @addr: User/Memory fence address */
__u64 addr;
/** @ctx_id: Id of the Context which will signal the fence. */
__u32 ctx_id;
/** @op: Wakeup condition operator */
__u16 op;
+#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7
/**
* @flags: Supported flags are,
*
* I915_UFENCE_WAIT_SOFT:
*
* To be woken up by i915 driver async worker (not by GPU).
*
* I915_UFENCE_WAIT_ABSTIME:
*
* Wait timeout specified as absolute time.
*/
__u16 flags;
+#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2
/** @value: Wakeup value */
__u64 value;
/** @mask: Wakeup mask */
__u64 mask;
+#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull
/**
* @timeout: Wait timeout in nanoseconds.
*
* If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
* absolute time in nsec.
*/
__s64 timeout;
+};
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
VM_BIND and related uapi definitions
v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/**
- DOC: I915_PARAM_HAS_VM_BIND
- VM_BIND feature availability.
- See typedef drm_i915_getparam_t param.
- */
+#define I915_PARAM_HAS_VM_BIND 57
+/**
- DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
- Flag to opt-in for VM_BIND mode of binding during VM creation.
- See struct drm_i915_gem_vm_control flags.
- A VM in VM_BIND mode will not support the older execbuff mode of binding.
- In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
- &drm_i915_gem_execbuffer2.buffer_count must be 0).
- Also, &drm_i915_gem_execbuffer2.batch_start_offset and
- &drm_i915_gem_execbuffer2.batch_len must be 0.
- DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
- to pass in the batch buffer addresses.
- Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
- I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
- (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
- set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
- The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
- of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
- */
From that description, it seems we have:
struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using extensions __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! __u64 flags; -> some flags must be 0 (new) __u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };
Based on that, why can't we just get drm_i915_gem_execbuffer3 instead of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how the VM was created (which is an entirely different ioctl).
From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll need to create a -11 entry just to deal with this ioctl.
The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).
Why not? it's not like adding extensions here is really that different than adding new ioctls.
I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.
Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.
Dave.
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
VM_BIND and related uapi definitions
v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/**
- DOC: I915_PARAM_HAS_VM_BIND
- VM_BIND feature availability.
- See typedef drm_i915_getparam_t param.
- */
+#define I915_PARAM_HAS_VM_BIND 57
+/**
- DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
- Flag to opt-in for VM_BIND mode of binding during VM creation.
- See struct drm_i915_gem_vm_control flags.
- A VM in VM_BIND mode will not support the older execbuff mode of binding.
- In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
- &drm_i915_gem_execbuffer2.buffer_count must be 0).
- Also, &drm_i915_gem_execbuffer2.batch_start_offset and
- &drm_i915_gem_execbuffer2.batch_len must be 0.
- DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
- to pass in the batch buffer addresses.
- Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
- I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
- (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
- set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
- The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
- of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
- */
From that description, it seems we have:
struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using extensions __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! __u64 flags; -> some flags must be 0 (new) __u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };
Based on that, why can't we just get drm_i915_gem_execbuffer3 instead of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how the VM was created (which is an entirely different ioctl).
From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll need to create a -11 entry just to deal with this ioctl.
The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).
Why not? it's not like adding extensions here is really that different than adding new ioctls.
I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.
Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.
I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.
If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.
Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.
Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.
So maybe we should do this? -Daniel
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
VM_BIND and related uapi definitions
v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/**
- DOC: I915_PARAM_HAS_VM_BIND
- VM_BIND feature availability.
- See typedef drm_i915_getparam_t param.
- */
+#define I915_PARAM_HAS_VM_BIND 57
+/**
- DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
- Flag to opt-in for VM_BIND mode of binding during VM creation.
- See struct drm_i915_gem_vm_control flags.
- A VM in VM_BIND mode will not support the older execbuff mode of binding.
- In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
- &drm_i915_gem_execbuffer2.buffer_count must be 0).
- Also, &drm_i915_gem_execbuffer2.batch_start_offset and
- &drm_i915_gem_execbuffer2.batch_len must be 0.
- DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
- to pass in the batch buffer addresses.
- Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
- I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
- (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
- set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
- The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
- of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
- */
From that description, it seems we have:
struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using extensions __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! __u64 flags; -> some flags must be 0 (new) __u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };
Based on that, why can't we just get drm_i915_gem_execbuffer3 instead of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how the VM was created (which is an entirely different ioctl).
From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll need to create a -11 entry just to deal with this ioctl.
The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).
Why not? it's not like adding extensions here is really that different than adding new ioctls.
I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.
Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.
I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.
If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.
Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.
Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.
So maybe we should do this?
Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.
Ok, I will update this series by adding execbuf3 and send out soon.
Niranjana
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:
VM_BIND and related uapi definitions
v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/**
- DOC: I915_PARAM_HAS_VM_BIND
- VM_BIND feature availability.
- See typedef drm_i915_getparam_t param.
- */
+#define I915_PARAM_HAS_VM_BIND 57
+/**
- DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
- Flag to opt-in for VM_BIND mode of binding during VM creation.
- See struct drm_i915_gem_vm_control flags.
- A VM in VM_BIND mode will not support the older execbuff mode of binding.
- In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
- &drm_i915_gem_execbuffer2.buffer_count must be 0).
- Also, &drm_i915_gem_execbuffer2.batch_start_offset and
- &drm_i915_gem_execbuffer2.batch_len must be 0.
- DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
- to pass in the batch buffer addresses.
- Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
- I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
- (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
- set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
- The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
- of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
- */
From that description, it seems we have:
struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using extensions __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! __u64 flags; -> some flags must be 0 (new) __u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };
Based on that, why can't we just get drm_i915_gem_execbuffer3 instead of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how the VM was created (which is an entirely different ioctl).
From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll need to create a -11 entry just to deal with this ioctl.
The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).
Why not? it's not like adding extensions here is really that different than adding new ioctls.
I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.
Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.
I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.
If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.
Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.
Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.
So maybe we should do this?
Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.
Ok, I will update this series by adding execbuf3 and send out soon.
Does this sound reasonable?
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */
__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */
__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)
#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)
#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11) #define I915_EXEC3_FENCE_SUBMIT (1<<12)
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */
__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };
With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).
I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
Any thing else needs to be added or removed?
Niranjana
Niranjana
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: > VM_BIND and related uapi definitions > > v2: Ensure proper kernel-doc formatting with cross references. > Also add new uapi and documentation as per review comments > from Daniel. > > Signed-off-by: Niranjana Vishwanathapura
niranjana.vishwanathapura@intel.com
> --- > Documentation/gpu/rfc/i915_vm_bind.h | 399
+++++++++++++++++++++++++++
> 1 file changed, 399 insertions(+) > create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h > > diff --git a/Documentation/gpu/rfc/i915_vm_bind.h
b/Documentation/gpu/rfc/i915_vm_bind.h
> new file mode 100644 > index 000000000000..589c0a009107 > --- /dev/null > +++ b/Documentation/gpu/rfc/i915_vm_bind.h > @@ -0,0 +1,399 @@ > +/* SPDX-License-Identifier: MIT */ > +/* > + * Copyright © 2022 Intel Corporation > + */ > + > +/** > + * DOC: I915_PARAM_HAS_VM_BIND > + * > + * VM_BIND feature availability. > + * See typedef drm_i915_getparam_t param. > + */ > +#define I915_PARAM_HAS_VM_BIND 57 > + > +/** > + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND > + * > + * Flag to opt-in for VM_BIND mode of binding during VM creation. > + * See struct drm_i915_gem_vm_control flags. > + * > + * A VM in VM_BIND mode will not support the older execbuff
mode of binding.
> + * In VM_BIND mode, execbuff ioctl will not accept any execlist
(ie., the
> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). > + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and > + * &drm_i915_gem_execbuffer2.batch_len must be 0. > + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must
be provided
> + * to pass in the batch buffer addresses. > + * > + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and > + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags
must be 0
> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag
must always be
> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). > + * The buffers_ptr, buffer_count, batch_start_offset and
batch_len fields
> + * of struct drm_i915_gem_execbuffer2 are also not used and
must be 0.
> + */
From that description, it seems we have:
struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using
extensions
__u64 cliprects_ptr; (fences, extensions) -> contains an
actual pointer!
__u64 flags; -> some flags must be 0 (new) __u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };
Based on that, why can't we just get drm_i915_gem_execbuffer3 instead of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how
the VM
was created (which is an entirely different ioctl).
From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll need to create a -11 entry just to deal with this ioctl.
The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).
Why not? it's not like adding extensions here is really that different than adding new ioctls.
I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.
Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.
I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.
If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.
Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.
Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.
So maybe we should do this?
Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.
Ok, I will update this series by adding execbuf3 and send out soon.
Does this sound reasonable?
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */
__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */
Casual stumble upon..
Alternatively you could embed N pointers to make life a bit easier for both userspace and kernel side. Yes, but then "N batch buffers should be enough for everyone" problem.. :)
__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)
#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)
#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)
I'd suggest legacy engine selection is unwanted, especially not with the convoluted BSD1/2 flags. Can we just require context with engine map and index? Or if default context has to be supported then I'd suggest ...class_instance for that mode.
#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11) #define I915_EXEC3_FENCE_SUBMIT (1<<12)
People are likely to object to submit fence since generic mechanism to align submissions was rejected.
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */
New ioctl you can afford dedicated fields.
In any case I suggest you involve UMD folks in designing it.
Regards,
Tvrtko
__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };
With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).
I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
Any thing else needs to be added or removed?
Niranjana
Niranjana
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Tue, Jun 07, 2022 at 11:42:08AM +0100, Tvrtko Ursulin wrote:
On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: >> VM_BIND and related uapi definitions >> >> v2: Ensure proper kernel-doc formatting with cross references. >> Also add new uapi and documentation as per review comments >> from Daniel. >> >> Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com >> --- >> Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ >> 1 file changed, 399 insertions(+) >> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h >> new file mode 100644 >> index 000000000000..589c0a009107 >> --- /dev/null >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >> @@ -0,0 +1,399 @@ >> +/* SPDX-License-Identifier: MIT */ >> +/* >> + * Copyright © 2022 Intel Corporation >> + */ >> + >> +/** >> + * DOC: I915_PARAM_HAS_VM_BIND >> + * >> + * VM_BIND feature availability. >> + * See typedef drm_i915_getparam_t param. >> + */ >> +#define I915_PARAM_HAS_VM_BIND 57 >> + >> +/** >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >> + * >> + * Flag to opt-in for VM_BIND mode of binding during VM creation. >> + * See struct drm_i915_gem_vm_control flags. >> + * >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding. >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided >> + * to pass in the batch buffer addresses. >> + * >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0 >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0. >> + */ > >From that description, it seems we have: > >struct drm_i915_gem_execbuffer2 { > __u64 buffers_ptr; -> must be 0 (new) > __u32 buffer_count; -> must be 0 (new) > __u32 batch_start_offset; -> must be 0 (new) > __u32 batch_len; -> must be 0 (new) > __u32 DR1; -> must be 0 (old) > __u32 DR4; -> must be 0 (old) > __u32 num_cliprects; (fences) -> must be 0 since using extensions > __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! > __u64 flags; -> some flags must be 0 (new) > __u64 rsvd1; (context info) -> repurposed field (old) > __u64 rsvd2; -> unused >}; > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead >of adding even more complexity to an already abused interface? While >the Vulkan-like extension thing is really nice, I don't think what >we're doing here is extending the ioctl usage, we're completely >changing how the base struct should be interpreted based on how the VM >was created (which is an entirely different ioctl). > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is >already at -6 without these changes. I think after vm_bind we'll need >to create a -11 entry just to deal with this ioctl. >
The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).
Why not? it's not like adding extensions here is really that different than adding new ioctls.
I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.
Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.
I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.
If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.
Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.
Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.
So maybe we should do this?
Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.
Ok, I will update this series by adding execbuf3 and send out soon.
Does this sound reasonable?
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */
__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */
Casual stumble upon..
Alternatively you could embed N pointers to make life a bit easier for both userspace and kernel side. Yes, but then "N batch buffers should be enough for everyone" problem.. :)
Thanks Tvrtko, Yes, hence the batch_addr_ptr.
__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)
#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)
#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)
I'd suggest legacy engine selection is unwanted, especially not with the convoluted BSD1/2 flags. Can we just require context with engine map and index? Or if default context has to be supported then I'd suggest ...class_instance for that mode.
Ok, I will be happy to remove it and only support contexts with engine map, if UMDs agree on that.
#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11) #define I915_EXEC3_FENCE_SUBMIT (1<<12)
People are likely to object to submit fence since generic mechanism to align submissions was rejected.
Ok, again, I can remove it if UMDs are ok with it.
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */
New ioctl you can afford dedicated fields.
Yes, but as I asked below, I am not sure if we need this or the timeline fence arry extension we have is good enough.
In any case I suggest you involve UMD folks in designing it.
Yah. Paulo, Lionel, Jason, Daniel, can you comment on these regarding what will UMD need in execbuf3 and what can be removed?
Thanks, Niranjana
Regards,
Tvrtko
__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };
With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).
I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
Any thing else needs to be added or removed?
Niranjana
Niranjana
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On 07/06/2022 22:25, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 11:42:08AM +0100, Tvrtko Ursulin wrote:
On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote: > > On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >> On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: >>> VM_BIND and related uapi definitions >>> >>> v2: Ensure proper kernel-doc formatting with cross references. >>> Also add new uapi and documentation as per review comments >>> from Daniel. >>> >>> Signed-off-by: Niranjana Vishwanathapura > niranjana.vishwanathapura@intel.com >>> --- >>> Documentation/gpu/rfc/i915_vm_bind.h | 399 > +++++++++++++++++++++++++++ >>> 1 file changed, 399 insertions(+) >>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >>> >>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h > b/Documentation/gpu/rfc/i915_vm_bind.h >>> new file mode 100644 >>> index 000000000000..589c0a009107 >>> --- /dev/null >>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >>> @@ -0,0 +1,399 @@ >>> +/* SPDX-License-Identifier: MIT */ >>> +/* >>> + * Copyright © 2022 Intel Corporation >>> + */ >>> + >>> +/** >>> + * DOC: I915_PARAM_HAS_VM_BIND >>> + * >>> + * VM_BIND feature availability. >>> + * See typedef drm_i915_getparam_t param. >>> + */ >>> +#define I915_PARAM_HAS_VM_BIND 57 >>> + >>> +/** >>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >>> + * >>> + * Flag to opt-in for VM_BIND mode of binding during VM creation. >>> + * See struct drm_i915_gem_vm_control flags. >>> + * >>> + * A VM in VM_BIND mode will not support the older > execbuff mode of binding. >>> + * In VM_BIND mode, execbuff ioctl will not accept any > execlist (ie., the >>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >>> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension > must be provided >>> + * to pass in the batch buffer addresses. >>> + * >>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >>> + * I915_EXEC_BATCH_FIRST of > &drm_i915_gem_execbuffer2.flags must be 0 >>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS > flag must always be >>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >>> + * The buffers_ptr, buffer_count, batch_start_offset and > batch_len fields >>> + * of struct drm_i915_gem_execbuffer2 are also not used > and must be 0. >>> + */ >> >> From that description, it seems we have: >> >> struct drm_i915_gem_execbuffer2 { >> __u64 buffers_ptr; -> must be 0 (new) >> __u32 buffer_count; -> must be 0 (new) >> __u32 batch_start_offset; -> must be 0 (new) >> __u32 batch_len; -> must be 0 (new) >> __u32 DR1; -> must be 0 (old) >> __u32 DR4; -> must be 0 (old) >> __u32 num_cliprects; (fences) -> must be 0 since > using extensions >> __u64 cliprects_ptr; (fences, extensions) -> > contains an actual pointer! >> __u64 flags; -> some flags must be 0 >> (new) >> __u64 rsvd1; (context info) -> repurposed field (old) >> __u64 rsvd2; -> unused >> }; >> >> Based on that, why can't we just get drm_i915_gem_execbuffer3 >> instead >> of adding even more complexity to an already abused interface? >> While >> the Vulkan-like extension thing is really nice, I don't think what >> we're doing here is extending the ioctl usage, we're completely >> changing how the base struct should be interpreted based on > how the VM >> was created (which is an entirely different ioctl). >> >> From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is >> already at -6 without these changes. I think after vm_bind we'll >> need >> to create a -11 entry just to deal with this ioctl. >> > > The only change here is removing the execlist support for VM_BIND > mode (other than natual extensions). > Adding a new execbuffer3 was considered, but I think we need to > be careful > with that as that goes beyond the VM_BIND support, including any > future > requirements (as we don't want an execbuffer4 after VM_BIND).
Why not? it's not like adding extensions here is really that different than adding new ioctls.
I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.
Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.
I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.
If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.
Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.
Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.
So maybe we should do this?
Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.
Ok, I will update this series by adding execbuf3 and send out soon.
Does this sound reasonable?
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */
__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */
Casual stumble upon..
Alternatively you could embed N pointers to make life a bit easier for both userspace and kernel side. Yes, but then "N batch buffers should be enough for everyone" problem.. :)
Thanks Tvrtko, Yes, hence the batch_addr_ptr.
Right, but then userspace has to allocate a separate buffer and kernel has to access it separately from a single copy_from_user. Pros and cons of "this many batches should be enough for everyone" versus the extra operations.
Hmm.. for the common case of one batch - you could define the uapi to say if batch_count is one then pointer is GPU VA to the batch itself, not a pointer to userspace array of GPU VA?
Regards,
Tvrtko
__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)
#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)
#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)
I'd suggest legacy engine selection is unwanted, especially not with the convoluted BSD1/2 flags. Can we just require context with engine map and index? Or if default context has to be supported then I'd suggest ...class_instance for that mode.
Ok, I will be happy to remove it and only support contexts with engine map, if UMDs agree on that.
#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11) #define I915_EXEC3_FENCE_SUBMIT (1<<12)
People are likely to object to submit fence since generic mechanism to align submissions was rejected.
Ok, again, I can remove it if UMDs are ok with it.
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */
New ioctl you can afford dedicated fields.
Yes, but as I asked below, I am not sure if we need this or the timeline fence arry extension we have is good enough.
In any case I suggest you involve UMD folks in designing it.
Yah. Paulo, Lionel, Jason, Daniel, can you comment on these regarding what will UMD need in execbuf3 and what can be removed?
Thanks, Niranjana
Regards,
Tvrtko
__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };
With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).
I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
Any thing else needs to be added or removed?
Niranjana
Niranjana
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Jun 08, 2022 at 08:34:36AM +0100, Tvrtko Ursulin wrote:
On 07/06/2022 22:25, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 11:42:08AM +0100, Tvrtko Ursulin wrote:
On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote: > >On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura >niranjana.vishwanathapura@intel.com wrote: >> >>On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >>>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: >>>>VM_BIND and related uapi definitions >>>> >>>>v2: Ensure proper kernel-doc formatting with cross references. >>>> Also add new uapi and documentation as per review comments >>>> from Daniel. >>>> >>>>Signed-off-by: Niranjana Vishwanathapura >>niranjana.vishwanathapura@intel.com >>>>--- >>>> Documentation/gpu/rfc/i915_vm_bind.h | 399 >>+++++++++++++++++++++++++++ >>>> 1 file changed, 399 insertions(+) >>>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >>>> >>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.h >>b/Documentation/gpu/rfc/i915_vm_bind.h >>>>new file mode 100644 >>>>index 000000000000..589c0a009107 >>>>--- /dev/null >>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.h >>>>@@ -0,0 +1,399 @@ >>>>+/* SPDX-License-Identifier: MIT */ >>>>+/* >>>>+ * Copyright © 2022 Intel Corporation >>>>+ */ >>>>+ >>>>+/** >>>>+ * DOC: I915_PARAM_HAS_VM_BIND >>>>+ * >>>>+ * VM_BIND feature availability. >>>>+ * See typedef drm_i915_getparam_t param. >>>>+ */ >>>>+#define I915_PARAM_HAS_VM_BIND 57 >>>>+ >>>>+/** >>>>+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >>>>+ * >>>>+ * Flag to opt-in for VM_BIND mode of binding during VM creation. >>>>+ * See struct drm_i915_gem_vm_control flags. >>>>+ * >>>>+ * A VM in VM_BIND mode will not support the older >>execbuff mode of binding. >>>>+ * In VM_BIND mode, execbuff ioctl will not accept >>>>any >>execlist (ie., the >>>>+ * &drm_i915_gem_execbuffer2.buffer_count must be 0). >>>>+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >>>>+ * &drm_i915_gem_execbuffer2.batch_len must be 0. >>>>+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES >>>>extension >>must be provided >>>>+ * to pass in the batch buffer addresses. >>>>+ * >>>>+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >>>>+ * I915_EXEC_BATCH_FIRST of >>&drm_i915_gem_execbuffer2.flags must be 0 >>>>+ * (not used) in VM_BIND mode. >>>>I915_EXEC_USE_EXTENSIONS >>flag must always be >>>>+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >>>>+ * The buffers_ptr, buffer_count, >>>>batch_start_offset and >>batch_len fields >>>>+ * of struct drm_i915_gem_execbuffer2 are also not >>>>used >>and must be 0. >>>>+ */ >>> >>>From that description, it seems we have: >>> >>>struct drm_i915_gem_execbuffer2 { >>> __u64 buffers_ptr; -> must be 0 (new) >>> __u32 buffer_count; -> must be 0 (new) >>> __u32 batch_start_offset; -> must be 0 (new) >>> __u32 batch_len; -> must be 0 (new) >>> __u32 DR1; -> must be 0 (old) >>> __u32 DR4; -> must be 0 (old) >>> __u32 num_cliprects; (fences) -> must be 0 >>>since >>using extensions >>> __u64 cliprects_ptr; (fences, extensions) -> >>contains an actual pointer! >>> __u64 flags; -> some flags >>>must be 0 (new) >>> __u64 rsvd1; (context info) -> repurposed field (old) >>> __u64 rsvd2; -> unused >>>}; >>> >>>Based on that, why can't we just get >>>drm_i915_gem_execbuffer3 instead >>>of adding even more complexity to an already abused >>>interface? While >>>the Vulkan-like extension thing is really nice, I don't think what >>>we're doing here is extending the ioctl usage, we're completely >>>changing how the base struct should be interpreted >>>based on >>how the VM >>>was created (which is an entirely different ioctl). >>> >>>From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is >>>already at -6 without these changes. I think after >>>vm_bind we'll need >>>to create a -11 entry just to deal with this ioctl. >>> >> >>The only change here is removing the execlist support for VM_BIND >>mode (other than natual extensions). >>Adding a new execbuffer3 was considered, but I think we >>need to be careful >>with that as that goes beyond the VM_BIND support, >>including any future >>requirements (as we don't want an execbuffer4 after VM_BIND). > >Why not? it's not like adding extensions here is really >that different >than adding new ioctls. > >I definitely think this deserves an execbuffer3 without even >considering future requirements. Just to burn down the old >requirements and pointless fields. > >Make execbuffer3 be vm bind only, no relocs, no legacy >bits, leave the >older sw on execbuf2 for ever.
I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.
If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.
Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.
Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.
So maybe we should do this?
Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.
Ok, I will update this series by adding execbuf3 and send out soon.
Does this sound reasonable?
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */
__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */
Casual stumble upon..
Alternatively you could embed N pointers to make life a bit easier for both userspace and kernel side. Yes, but then "N batch buffers should be enough for everyone" problem.. :)
Thanks Tvrtko, Yes, hence the batch_addr_ptr.
Right, but then userspace has to allocate a separate buffer and kernel has to access it separately from a single copy_from_user. Pros and cons of "this many batches should be enough for everyone" versus the extra operations.
Hmm.. for the common case of one batch - you could define the uapi to say if batch_count is one then pointer is GPU VA to the batch itself, not a pointer to userspace array of GPU VA?
Yah, we can do that. ie., batch_addr_ptr is the batch VA when batch_count is 1. Otherwise, it is pointer to an array of batch VAs.
Other option is to move multi-batch support to an extension and here we will only have batch_addr (ie., support for 1 batch only).
I like the former one better (the one you suggested).
Niranjana
Regards,
Tvrtko
__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)
#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)
#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)
I'd suggest legacy engine selection is unwanted, especially not with the convoluted BSD1/2 flags. Can we just require context with engine map and index? Or if default context has to be supported then I'd suggest ...class_instance for that mode.
Ok, I will be happy to remove it and only support contexts with engine map, if UMDs agree on that.
#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11) #define I915_EXEC3_FENCE_SUBMIT (1<<12)
People are likely to object to submit fence since generic mechanism to align submissions was rejected.
Ok, again, I can remove it if UMDs are ok with it.
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */
New ioctl you can afford dedicated fields.
Yes, but as I asked below, I am not sure if we need this or the timeline fence arry extension we have is good enough.
In any case I suggest you involve UMD folks in designing it.
Yah. Paulo, Lionel, Jason, Daniel, can you comment on these regarding what will UMD need in execbuf3 and what can be removed?
Thanks, Niranjana
Regards,
Tvrtko
__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };
With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).
I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
Any thing else needs to be added or removed?
Niranjana
Niranjana
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: > VM_BIND and related uapi definitions > > v2: Ensure proper kernel-doc formatting with cross references. > Also add new uapi and documentation as per review comments > from Daniel. > > Signed-off-by: Niranjana Vishwanathapura
niranjana.vishwanathapura@intel.com
> --- > Documentation/gpu/rfc/i915_vm_bind.h | 399
+++++++++++++++++++++++++++
> 1 file changed, 399 insertions(+) > create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h > > diff --git a/Documentation/gpu/rfc/i915_vm_bind.h
b/Documentation/gpu/rfc/i915_vm_bind.h
> new file mode 100644 > index 000000000000..589c0a009107 > --- /dev/null > +++ b/Documentation/gpu/rfc/i915_vm_bind.h > @@ -0,0 +1,399 @@ > +/* SPDX-License-Identifier: MIT */ > +/* > + * Copyright © 2022 Intel Corporation > + */ > + > +/** > + * DOC: I915_PARAM_HAS_VM_BIND > + * > + * VM_BIND feature availability. > + * See typedef drm_i915_getparam_t param. > + */ > +#define I915_PARAM_HAS_VM_BIND 57 > + > +/** > + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND > + * > + * Flag to opt-in for VM_BIND mode of binding during VM creation. > + * See struct drm_i915_gem_vm_control flags. > + * > + * A VM in VM_BIND mode will not support the older execbuff
mode of binding.
> + * In VM_BIND mode, execbuff ioctl will not accept any
execlist (ie., the
> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). > + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and > + * &drm_i915_gem_execbuffer2.batch_len must be 0. > + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must
be provided
> + * to pass in the batch buffer addresses. > + * > + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and > + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags
must be 0
> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag
must always be
> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). > + * The buffers_ptr, buffer_count, batch_start_offset and
batch_len fields
> + * of struct drm_i915_gem_execbuffer2 are also not used and
must be 0.
> + */
From that description, it seems we have:
struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using
extensions
__u64 cliprects_ptr; (fences, extensions) -> contains an
actual pointer!
__u64 flags; -> some flags must be 0
(new)
__u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };
Based on that, why can't we just get drm_i915_gem_execbuffer3
instead
of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how
the VM
was created (which is an entirely different ioctl).
From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll
need
to create a -11 entry just to deal with this ioctl.
The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).
Why not? it's not like adding extensions here is really that different than adding new ioctls.
I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.
Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.
I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.
If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.
Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.
Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.
So maybe we should do this?
Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.
Ok, I will update this series by adding execbuf3 and send out soon.
Does this sound reasonable?
Thanks for proposing this. Some comments below.
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */
__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */
__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)
Shouldn't we use the new engine selection uAPI instead?
We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in drm_i915_gem_context_create_ext_setparam.
And you can also create virtual engines with the same extension.
It feels like this could be a single u32 with the engine index (in the context engine map).
#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)
What's the meaning of PINNED?
#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11)
For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.
So there isn't much point for FENCE_IN/OUT.
Maybe check with other UMDs?
#define I915_EXEC3_FENCE_SUBMIT (1<<12)
What's FENCE_SUBMIT?
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */
__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };
With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).
I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
Any thing else needs to be added or removed?
Niranjana
Niranjana
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On 08/06/2022 09:40, Lionel Landwerlin wrote:
On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: >> VM_BIND and related uapi definitions >> >> v2: Ensure proper kernel-doc formatting with cross references. >> Also add new uapi and documentation as per review comments >> from Daniel. >> >> Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com >> --- >> Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ >> 1 file changed, 399 insertions(+) >> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h >> new file mode 100644 >> index 000000000000..589c0a009107 >> --- /dev/null >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >> @@ -0,0 +1,399 @@ >> +/* SPDX-License-Identifier: MIT */ >> +/* >> + * Copyright © 2022 Intel Corporation >> + */ >> + >> +/** >> + * DOC: I915_PARAM_HAS_VM_BIND >> + * >> + * VM_BIND feature availability. >> + * See typedef drm_i915_getparam_t param. >> + */ >> +#define I915_PARAM_HAS_VM_BIND 57 >> + >> +/** >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >> + * >> + * Flag to opt-in for VM_BIND mode of binding during VM creation. >> + * See struct drm_i915_gem_vm_control flags. >> + * >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding. >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided >> + * to pass in the batch buffer addresses. >> + * >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0 >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0. >> + */ > >From that description, it seems we have: > >struct drm_i915_gem_execbuffer2 { > __u64 buffers_ptr; -> must be 0 (new) > __u32 buffer_count; -> must be 0 (new) > __u32 batch_start_offset; -> must be 0 (new) > __u32 batch_len; -> must be 0 (new) > __u32 DR1; -> must be 0 (old) > __u32 DR4; -> must be 0 (old) > __u32 num_cliprects; (fences) -> must be 0 since using extensions > __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! > __u64 flags; -> some flags must be 0 (new) > __u64 rsvd1; (context info) -> repurposed field (old) > __u64 rsvd2; -> unused >}; > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead >of adding even more complexity to an already abused interface? While >the Vulkan-like extension thing is really nice, I don't think what >we're doing here is extending the ioctl usage, we're completely >changing how the base struct should be interpreted based on how the VM >was created (which is an entirely different ioctl). > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is >already at -6 without these changes. I think after vm_bind we'll need >to create a -11 entry just to deal with this ioctl. >
The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).
Why not? it's not like adding extensions here is really that different than adding new ioctls.
I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.
Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.
I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.
If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.
Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.
Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.
So maybe we should do this?
Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.
Ok, I will update this series by adding execbuf3 and send out soon.
Does this sound reasonable?
Thanks for proposing this. Some comments below.
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */
__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */
__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)
Shouldn't we use the new engine selection uAPI instead?
We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in drm_i915_gem_context_create_ext_setparam.
And you can also create virtual engines with the same extension.
It feels like this could be a single u32 with the engine index (in the context engine map).
#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)
What's the meaning of PINNED?
#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11)
For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.
So there isn't much point for FENCE_IN/OUT.
Maybe check with other UMDs?
Correcting myself a bit here :
- iris uses I915_EXEC_FENCE_ARRAY
- anv uses I915_EXEC_FENCE_ARRAY or DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES
In either case we could easily switch to DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES all the time.
#define I915_EXEC3_FENCE_SUBMIT (1<<12)
What's FENCE_SUBMIT?
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */
__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };
With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).
I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
Any thing else needs to be added or removed?
Niranjana
Niranjana
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On 08/06/2022 07:40, Lionel Landwerlin wrote:
On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: >> VM_BIND and related uapi definitions >> >> v2: Ensure proper kernel-doc formatting with cross references. >> Also add new uapi and documentation as per review comments >> from Daniel. >> >> Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com >> --- >> Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ >> 1 file changed, 399 insertions(+) >> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h >> new file mode 100644 >> index 000000000000..589c0a009107 >> --- /dev/null >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >> @@ -0,0 +1,399 @@ >> +/* SPDX-License-Identifier: MIT */ >> +/* >> + * Copyright © 2022 Intel Corporation >> + */ >> + >> +/** >> + * DOC: I915_PARAM_HAS_VM_BIND >> + * >> + * VM_BIND feature availability. >> + * See typedef drm_i915_getparam_t param. >> + */ >> +#define I915_PARAM_HAS_VM_BIND 57 >> + >> +/** >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >> + * >> + * Flag to opt-in for VM_BIND mode of binding during VM creation. >> + * See struct drm_i915_gem_vm_control flags. >> + * >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding. >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided >> + * to pass in the batch buffer addresses. >> + * >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0 >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0. >> + */ > >From that description, it seems we have: > >struct drm_i915_gem_execbuffer2 { > __u64 buffers_ptr; -> must be 0 (new) > __u32 buffer_count; -> must be 0 (new) > __u32 batch_start_offset; -> must be 0 (new) > __u32 batch_len; -> must be 0 (new) > __u32 DR1; -> must be 0 (old) > __u32 DR4; -> must be 0 (old) > __u32 num_cliprects; (fences) -> must be 0 since using extensions > __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! > __u64 flags; -> some flags must be 0 (new) > __u64 rsvd1; (context info) -> repurposed field (old) > __u64 rsvd2; -> unused >}; > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead >of adding even more complexity to an already abused interface? While >the Vulkan-like extension thing is really nice, I don't think what >we're doing here is extending the ioctl usage, we're completely >changing how the base struct should be interpreted based on how the VM >was created (which is an entirely different ioctl). > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is >already at -6 without these changes. I think after vm_bind we'll need >to create a -11 entry just to deal with this ioctl. >
The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).
Why not? it's not like adding extensions here is really that different than adding new ioctls.
I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.
Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.
I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.
If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.
Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.
Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.
So maybe we should do this?
Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.
Ok, I will update this series by adding execbuf3 and send out soon.
Does this sound reasonable?
Thanks for proposing this. Some comments below.
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */
__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */
__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)
Shouldn't we use the new engine selection uAPI instead?
We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in drm_i915_gem_context_create_ext_setparam.
And you can also create virtual engines with the same extension.
It feels like this could be a single u32 with the engine index (in the context engine map).
Yes I said the same yesterday.
Also note that as you can't any longer set engines on a default context, question is whether userspace cares to use execbuf3 with it (default context).
If it does, it will need an alternative engine selection for that case. I was proposing class:instance rather than legacy cumbersome flags.
If it does not, I mean if the decision is to only allow execbuf3 with engine maps, then it leaves the default context a waste of kernel memory in the execbuf3 future. :( Don't know what to do there..
Regards,
Tvrtko
#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)
What's the meaning of PINNED?
#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11)
For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.
So there isn't much point for FENCE_IN/OUT.
Maybe check with other UMDs?
#define I915_EXEC3_FENCE_SUBMIT (1<<12)
What's FENCE_SUBMIT?
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */
__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };
With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).
I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
Any thing else needs to be added or removed?
Niranjana
Niranjana
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On 08/06/2022 11:36, Tvrtko Ursulin wrote:
On 08/06/2022 07:40, Lionel Landwerlin wrote:
On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote: > > On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: > >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura > wrote: > >> VM_BIND and related uapi definitions > >> > >> v2: Ensure proper kernel-doc formatting with cross references. > >> Also add new uapi and documentation as per review comments > >> from Daniel. > >> > >> Signed-off-by: Niranjana Vishwanathapura > niranjana.vishwanathapura@intel.com > >> --- > >> Documentation/gpu/rfc/i915_vm_bind.h | 399 > +++++++++++++++++++++++++++ > >> 1 file changed, 399 insertions(+) > >> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h > >> > >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h > b/Documentation/gpu/rfc/i915_vm_bind.h > >> new file mode 100644 > >> index 000000000000..589c0a009107 > >> --- /dev/null > >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h > >> @@ -0,0 +1,399 @@ > >> +/* SPDX-License-Identifier: MIT */ > >> +/* > >> + * Copyright © 2022 Intel Corporation > >> + */ > >> + > >> +/** > >> + * DOC: I915_PARAM_HAS_VM_BIND > >> + * > >> + * VM_BIND feature availability. > >> + * See typedef drm_i915_getparam_t param. > >> + */ > >> +#define I915_PARAM_HAS_VM_BIND 57 > >> + > >> +/** > >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND > >> + * > >> + * Flag to opt-in for VM_BIND mode of binding during VM > creation. > >> + * See struct drm_i915_gem_vm_control flags. > >> + * > >> + * A VM in VM_BIND mode will not support the older execbuff > mode of binding. > >> + * In VM_BIND mode, execbuff ioctl will not accept any > execlist (ie., the > >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). > >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and > >> + * &drm_i915_gem_execbuffer2.batch_len must be 0. > >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension > must be provided > >> + * to pass in the batch buffer addresses. > >> + * > >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and > >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags > must be 0 > >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag > must always be > >> + * set (See struct > drm_i915_gem_execbuffer_ext_batch_addresses). > >> + * The buffers_ptr, buffer_count, batch_start_offset and > batch_len fields > >> + * of struct drm_i915_gem_execbuffer2 are also not used and > must be 0. > >> + */ > > > >From that description, it seems we have: > > > >struct drm_i915_gem_execbuffer2 { > > __u64 buffers_ptr; -> must be 0 (new) > > __u32 buffer_count; -> must be 0 (new) > > __u32 batch_start_offset; -> must be 0 (new) > > __u32 batch_len; -> must be 0 (new) > > __u32 DR1; -> must be 0 (old) > > __u32 DR4; -> must be 0 (old) > > __u32 num_cliprects; (fences) -> must be 0 since > using extensions > > __u64 cliprects_ptr; (fences, extensions) -> contains > an actual pointer! > > __u64 flags; -> some flags must be 0 > (new) > > __u64 rsvd1; (context info) -> repurposed field (old) > > __u64 rsvd2; -> unused > >}; > > > >Based on that, why can't we just get drm_i915_gem_execbuffer3 > instead > >of adding even more complexity to an already abused interface? > While > >the Vulkan-like extension thing is really nice, I don't think what > >we're doing here is extending the ioctl usage, we're completely > >changing how the base struct should be interpreted based on how > the VM > >was created (which is an entirely different ioctl). > > > >From Rusty Russel's API Design grading, > drm_i915_gem_execbuffer2 is > >already at -6 without these changes. I think after vm_bind > we'll need > >to create a -11 entry just to deal with this ioctl. > > > > The only change here is removing the execlist support for VM_BIND > mode (other than natual extensions). > Adding a new execbuffer3 was considered, but I think we need to > be careful > with that as that goes beyond the VM_BIND support, including any > future > requirements (as we don't want an execbuffer4 after VM_BIND).
Why not? it's not like adding extensions here is really that different than adding new ioctls.
I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.
Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.
I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.
If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.
Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.
Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.
So maybe we should do this?
Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.
Ok, I will update this series by adding execbuf3 and send out soon.
Does this sound reasonable?
Thanks for proposing this. Some comments below.
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */
__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */
__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)
Shouldn't we use the new engine selection uAPI instead?
We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in drm_i915_gem_context_create_ext_setparam.
And you can also create virtual engines with the same extension.
It feels like this could be a single u32 with the engine index (in the context engine map).
Yes I said the same yesterday.
Also note that as you can't any longer set engines on a default context, question is whether userspace cares to use execbuf3 with it (default context).
If it does, it will need an alternative engine selection for that case. I was proposing class:instance rather than legacy cumbersome flags.
If it does not, I mean if the decision is to only allow execbuf3 with engine maps, then it leaves the default context a waste of kernel memory in the execbuf3 future. :( Don't know what to do there..
Regards,
Tvrtko
Thanks Tvrtko, I only saw your reply after responding.
Both Iris & Anv create a context with engines (if kernel supports it) : https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/common/intel_...
I think we should be fine with just a single engine id and we don't care about the default context.
-Lionel
#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)
What's the meaning of PINNED?
#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11)
For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.
So there isn't much point for FENCE_IN/OUT.
Maybe check with other UMDs?
#define I915_EXEC3_FENCE_SUBMIT (1<<12)
What's FENCE_SUBMIT?
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */
__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };
With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).
I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
Any thing else needs to be added or removed?
Niranjana
Niranjana
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On 08/06/2022 09:45, Lionel Landwerlin wrote:
On 08/06/2022 11:36, Tvrtko Ursulin wrote:
On 08/06/2022 07:40, Lionel Landwerlin wrote:
On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote: > > On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura > niranjana.vishwanathapura@intel.com wrote: >> >> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura >> wrote: >> >> VM_BIND and related uapi definitions >> >> >> >> v2: Ensure proper kernel-doc formatting with cross references. >> >> Also add new uapi and documentation as per review comments >> >> from Daniel. >> >> >> >> Signed-off-by: Niranjana Vishwanathapura >> niranjana.vishwanathapura@intel.com >> >> --- >> >> Documentation/gpu/rfc/i915_vm_bind.h | 399 >> +++++++++++++++++++++++++++ >> >> 1 file changed, 399 insertions(+) >> >> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >> >> >> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h >> b/Documentation/gpu/rfc/i915_vm_bind.h >> >> new file mode 100644 >> >> index 000000000000..589c0a009107 >> >> --- /dev/null >> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >> >> @@ -0,0 +1,399 @@ >> >> +/* SPDX-License-Identifier: MIT */ >> >> +/* >> >> + * Copyright © 2022 Intel Corporation >> >> + */ >> >> + >> >> +/** >> >> + * DOC: I915_PARAM_HAS_VM_BIND >> >> + * >> >> + * VM_BIND feature availability. >> >> + * See typedef drm_i915_getparam_t param. >> >> + */ >> >> +#define I915_PARAM_HAS_VM_BIND 57 >> >> + >> >> +/** >> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >> >> + * >> >> + * Flag to opt-in for VM_BIND mode of binding during VM >> creation. >> >> + * See struct drm_i915_gem_vm_control flags. >> >> + * >> >> + * A VM in VM_BIND mode will not support the older execbuff >> mode of binding. >> >> + * In VM_BIND mode, execbuff ioctl will not accept any >> execlist (ie., the >> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension >> must be provided >> >> + * to pass in the batch buffer addresses. >> >> + * >> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags >> must be 0 >> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag >> must always be >> >> + * set (See struct >> drm_i915_gem_execbuffer_ext_batch_addresses). >> >> + * The buffers_ptr, buffer_count, batch_start_offset and >> batch_len fields >> >> + * of struct drm_i915_gem_execbuffer2 are also not used and >> must be 0. >> >> + */ >> > >> >From that description, it seems we have: >> > >> >struct drm_i915_gem_execbuffer2 { >> > __u64 buffers_ptr; -> must be 0 (new) >> > __u32 buffer_count; -> must be 0 (new) >> > __u32 batch_start_offset; -> must be 0 (new) >> > __u32 batch_len; -> must be 0 (new) >> > __u32 DR1; -> must be 0 (old) >> > __u32 DR4; -> must be 0 (old) >> > __u32 num_cliprects; (fences) -> must be 0 since >> using extensions >> > __u64 cliprects_ptr; (fences, extensions) -> contains >> an actual pointer! >> > __u64 flags; -> some flags must be 0 >> (new) >> > __u64 rsvd1; (context info) -> repurposed field (old) >> > __u64 rsvd2; -> unused >> >}; >> > >> >Based on that, why can't we just get drm_i915_gem_execbuffer3 >> instead >> >of adding even more complexity to an already abused interface? >> While >> >the Vulkan-like extension thing is really nice, I don't think what >> >we're doing here is extending the ioctl usage, we're completely >> >changing how the base struct should be interpreted based on how >> the VM >> >was created (which is an entirely different ioctl). >> > >> >From Rusty Russel's API Design grading, >> drm_i915_gem_execbuffer2 is >> >already at -6 without these changes. I think after vm_bind >> we'll need >> >to create a -11 entry just to deal with this ioctl. >> > >> >> The only change here is removing the execlist support for VM_BIND >> mode (other than natual extensions). >> Adding a new execbuffer3 was considered, but I think we need to >> be careful >> with that as that goes beyond the VM_BIND support, including any >> future >> requirements (as we don't want an execbuffer4 after VM_BIND). > > Why not? it's not like adding extensions here is really that > different > than adding new ioctls. > > I definitely think this deserves an execbuffer3 without even > considering future requirements. Just to burn down the old > requirements and pointless fields. > > Make execbuffer3 be vm bind only, no relocs, no legacy bits, > leave the > older sw on execbuf2 for ever.
I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.
If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.
Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.
Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.
So maybe we should do this?
Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.
Ok, I will update this series by adding execbuf3 and send out soon.
Does this sound reasonable?
Thanks for proposing this. Some comments below.
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */
__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */
__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)
Shouldn't we use the new engine selection uAPI instead?
We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in drm_i915_gem_context_create_ext_setparam.
And you can also create virtual engines with the same extension.
It feels like this could be a single u32 with the engine index (in the context engine map).
Yes I said the same yesterday.
Also note that as you can't any longer set engines on a default context, question is whether userspace cares to use execbuf3 with it (default context).
If it does, it will need an alternative engine selection for that case. I was proposing class:instance rather than legacy cumbersome flags.
If it does not, I mean if the decision is to only allow execbuf3 with engine maps, then it leaves the default context a waste of kernel memory in the execbuf3 future. :( Don't know what to do there..
Regards,
Tvrtko
Thanks Tvrtko, I only saw your reply after responding.
Both Iris & Anv create a context with engines (if kernel supports it) : https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/common/intel_...
I think we should be fine with just a single engine id and we don't care about the default context.
I wonder if in this case we could stop creating the default context starting from a future "gen"? Otherwise, with engine map only execbuf3 and execbuf3 only userspace, it would serve no purpose apart from wasting kernel memory.
Regards,
Tvrtko
-Lionel
#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)
What's the meaning of PINNED?
#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11)
For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.
So there isn't much point for FENCE_IN/OUT.
Maybe check with other UMDs?
#define I915_EXEC3_FENCE_SUBMIT (1<<12)
What's FENCE_SUBMIT?
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */
__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };
With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).
I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
Any thing else needs to be added or removed?
Niranjana
Niranjana
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Jun 08, 2022 at 09:54:24AM +0100, Tvrtko Ursulin wrote:
On 08/06/2022 09:45, Lionel Landwerlin wrote:
On 08/06/2022 11:36, Tvrtko Ursulin wrote:
On 08/06/2022 07:40, Lionel Landwerlin wrote:
On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote: >On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote: >> >>On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura >>niranjana.vishwanathapura@intel.com wrote: >>> >>>On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >>>>On Tue, 2022-05-17 at 11:32 -0700, Niranjana >>>Vishwanathapura wrote: >>>>> VM_BIND and related uapi definitions >>>>> >>>>> v2: Ensure proper kernel-doc formatting with cross references. >>>>> Also add new uapi and documentation as per review comments >>>>> from Daniel. >>>>> >>>>> Signed-off-by: Niranjana Vishwanathapura >>>niranjana.vishwanathapura@intel.com >>>>> --- >>>>> Documentation/gpu/rfc/i915_vm_bind.h | 399 >>>+++++++++++++++++++++++++++ >>>>> 1 file changed, 399 insertions(+) >>>>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >>>>> >>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h >>>b/Documentation/gpu/rfc/i915_vm_bind.h >>>>> new file mode 100644 >>>>> index 000000000000..589c0a009107 >>>>> --- /dev/null >>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >>>>> @@ -0,0 +1,399 @@ >>>>> +/* SPDX-License-Identifier: MIT */ >>>>> +/* >>>>> + * Copyright © 2022 Intel Corporation >>>>> + */ >>>>> + >>>>> +/** >>>>> + * DOC: I915_PARAM_HAS_VM_BIND >>>>> + * >>>>> + * VM_BIND feature availability. >>>>> + * See typedef drm_i915_getparam_t param. >>>>> + */ >>>>> +#define I915_PARAM_HAS_VM_BIND 57 >>>>> + >>>>> +/** >>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >>>>> + * >>>>> + * Flag to opt-in for VM_BIND mode of binding >>>during VM creation. >>>>> + * See struct drm_i915_gem_vm_control flags. >>>>> + * >>>>> + * A VM in VM_BIND mode will not support the older >>>execbuff mode of binding. >>>>> + * In VM_BIND mode, execbuff ioctl will not accept >>>any execlist (ie., the >>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES >>>extension must be provided >>>>> + * to pass in the batch buffer addresses. >>>>> + * >>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >>>>> + * I915_EXEC_BATCH_FIRST of >>>&drm_i915_gem_execbuffer2.flags must be 0 >>>>> + * (not used) in VM_BIND mode. >>>I915_EXEC_USE_EXTENSIONS flag must always be >>>>> + * set (See struct >>>drm_i915_gem_execbuffer_ext_batch_addresses). >>>>> + * The buffers_ptr, buffer_count, >>>batch_start_offset and batch_len fields >>>>> + * of struct drm_i915_gem_execbuffer2 are also not >>>used and must be 0. >>>>> + */ >>>> >>>>From that description, it seems we have: >>>> >>>>struct drm_i915_gem_execbuffer2 { >>>> __u64 buffers_ptr; -> must be 0 (new) >>>> __u32 buffer_count; -> must be 0 (new) >>>> __u32 batch_start_offset; -> must be 0 (new) >>>> __u32 batch_len; -> must be 0 (new) >>>> __u32 DR1; -> must be 0 (old) >>>> __u32 DR4; -> must be 0 (old) >>>> __u32 num_cliprects; (fences) -> must be 0 >>>since using extensions >>>> __u64 cliprects_ptr; (fences, extensions) -> >>>contains an actual pointer! >>>> __u64 flags; -> some flags >>>must be 0 (new) >>>> __u64 rsvd1; (context info) -> repurposed field (old) >>>> __u64 rsvd2; -> unused >>>>}; >>>> >>>>Based on that, why can't we just get >>>drm_i915_gem_execbuffer3 instead >>>>of adding even more complexity to an already abused >>>interface? While >>>>the Vulkan-like extension thing is really nice, I don't think what >>>>we're doing here is extending the ioctl usage, we're completely >>>>changing how the base struct should be interpreted >>>based on how the VM >>>>was created (which is an entirely different ioctl). >>>> >>>>From Rusty Russel's API Design grading, >>>drm_i915_gem_execbuffer2 is >>>>already at -6 without these changes. I think after >>>vm_bind we'll need >>>>to create a -11 entry just to deal with this ioctl. >>>> >>> >>>The only change here is removing the execlist support for VM_BIND >>>mode (other than natual extensions). >>>Adding a new execbuffer3 was considered, but I think >>>we need to be careful >>>with that as that goes beyond the VM_BIND support, >>>including any future >>>requirements (as we don't want an execbuffer4 after VM_BIND). >> >>Why not? it's not like adding extensions here is really >>that different >>than adding new ioctls. >> >>I definitely think this deserves an execbuffer3 without even >>considering future requirements. Just to burn down the old >>requirements and pointless fields. >> >>Make execbuffer3 be vm bind only, no relocs, no legacy >>bits, leave the >>older sw on execbuf2 for ever. > >I guess another point in favour of execbuf3 would be that it's less >midlayer. If we share the entry point then there's quite a few vfuncs >needed to cleanly split out the vm_bind paths from the legacy >reloc/softping paths. > >If we invert this and do execbuf3, then there's the existing ioctl >vfunc, and then we share code (where it even makes sense, probably >request setup/submit need to be shared, anything else is probably >cleaner to just copypaste) with the usual helper approach. > >Also that would guarantee that really none of the old concepts like >i915_active on the vma or vma open counts and all that stuff leaks >into the new vm_bind execbuf. > >Finally I also think that copypasting would make backporting easier, >or at least more flexible, since it should make it easier to have the >upstream vm_bind co-exist with all the other things we have. Without >huge amounts of conflicts (or at least much less) that pushing a pile >of vfuncs into the existing code would cause. > >So maybe we should do this?
Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.
Ok, I will update this series by adding execbuf3 and send out soon.
Does this sound reasonable?
Thanks for proposing this. Some comments below.
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */
__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */
__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)
Shouldn't we use the new engine selection uAPI instead?
We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in drm_i915_gem_context_create_ext_setparam.
And you can also create virtual engines with the same extension.
It feels like this could be a single u32 with the engine index (in the context engine map).
Yes I said the same yesterday.
Also note that as you can't any longer set engines on a default context, question is whether userspace cares to use execbuf3 with it (default context).
If it does, it will need an alternative engine selection for that case. I was proposing class:instance rather than legacy cumbersome flags.
If it does not, I mean if the decision is to only allow execbuf3 with engine maps, then it leaves the default context a waste of kernel memory in the execbuf3 future. :( Don't know what to do there..
Regards,
Tvrtko
Thanks Tvrtko, I only saw your reply after responding.
Both Iris & Anv create a context with engines (if kernel supports it) : https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/common/intel_...
I think we should be fine with just a single engine id and we don't care about the default context.
I wonder if in this case we could stop creating the default context starting from a future "gen"? Otherwise, with engine map only execbuf3 and execbuf3 only userspace, it would serve no purpose apart from wasting kernel memory.
Thanks Tvrtko, Lionell.
I will be glad to remove these flags, just define a uint32 engine_id and mandate a context with user engines map.
Regarding removing the default context, yah, it depends on from which gen onwards we will only be supporting execbuf3 and execbuf2 is fully deprecated. Till then, we will have to keep it I guess :(.
Regards,
Tvrtko
-Lionel
#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)
What's the meaning of PINNED?
This turned out to be a legacy use case. Will remove it. execbuf3 will anyway only be supported when HAS_VM_BIND is true.
#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11)
For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.
So there isn't much point for FENCE_IN/OUT.
Maybe check with other UMDs?
Thanks, will remove it if other UMDs do not ask for it.
#define I915_EXEC3_FENCE_SUBMIT (1<<12)
What's FENCE_SUBMIT?
This seems to be a mechanism to align requests submissions together. As per Tvrtko, generic mechanism to align submissions was rejected. So, if UMDs don't need it, we can remove it.
So, execbuf3 would look like (if all UMDS agree),
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */ __u32 engine_id; /* previously 'execbuffer2.flags & I915_EXEC_RING_MASK' */
__u32 rsvd1; /* Reserved */ __u32 batch_count; /* batch VA if batch_count=1, otherwise a pointer to an array of batch VAs */ __u64 batch_address;
__u64 flags; #define I915_EXEC3_SECURE (1<<0)
__u64 rsvd2; /* Reserved */ __u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };
Also, wondered if we need to put timeline fences in the extension or should we directly put it in drm_i915_gem_execbuffer3 struct. I prefer putting it in extension if they are not specified for all execbuff calls. Any thoughts?
Niranjana
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */
__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };
With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).
I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
Any thing else needs to be added or removed?
Niranjana
Niranjana
>-Daniel >-- >Daniel Vetter >Software Engineer, Intel Corporation >http://blog.ffwll.ch
On 08/06/2022 21:45, Niranjana Vishwanathapura wrote:
On Wed, Jun 08, 2022 at 09:54:24AM +0100, Tvrtko Ursulin wrote:
On 08/06/2022 09:45, Lionel Landwerlin wrote:
On 08/06/2022 11:36, Tvrtko Ursulin wrote:
On 08/06/2022 07:40, Lionel Landwerlin wrote:
On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote: > On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote: >> On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote: >>> >>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura >>> niranjana.vishwanathapura@intel.com wrote: >>>> >>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >>>>> On Tue, 2022-05-17 at 11:32 -0700, Niranjana >>>> Vishwanathapura wrote: >>>>>> VM_BIND and related uapi definitions >>>>>> >>>>>> v2: Ensure proper kernel-doc formatting with cross references. >>>>>> Also add new uapi and documentation as per review comments >>>>>> from Daniel. >>>>>> >>>>>> Signed-off-by: Niranjana Vishwanathapura >>>> niranjana.vishwanathapura@intel.com >>>>>> --- >>>>>> Documentation/gpu/rfc/i915_vm_bind.h | 399 >>>> +++++++++++++++++++++++++++ >>>>>> 1 file changed, 399 insertions(+) >>>>>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >>>>>> >>>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h >>>> b/Documentation/gpu/rfc/i915_vm_bind.h >>>>>> new file mode 100644 >>>>>> index 000000000000..589c0a009107 >>>>>> --- /dev/null >>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >>>>>> @@ -0,0 +1,399 @@ >>>>>> +/* SPDX-License-Identifier: MIT */ >>>>>> +/* >>>>>> + * Copyright © 2022 Intel Corporation >>>>>> + */ >>>>>> + >>>>>> +/** >>>>>> + * DOC: I915_PARAM_HAS_VM_BIND >>>>>> + * >>>>>> + * VM_BIND feature availability. >>>>>> + * See typedef drm_i915_getparam_t param. >>>>>> + */ >>>>>> +#define I915_PARAM_HAS_VM_BIND 57 >>>>>> + >>>>>> +/** >>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >>>>>> + * >>>>>> + * Flag to opt-in for VM_BIND mode of binding >>>> during VM creation. >>>>>> + * See struct drm_i915_gem_vm_control flags. >>>>>> + * >>>>>> + * A VM in VM_BIND mode will not support the older >>>> execbuff mode of binding. >>>>>> + * In VM_BIND mode, execbuff ioctl will not accept >>>> any execlist (ie., the >>>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >>>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >>>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >>>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES >>>> extension must be provided >>>>>> + * to pass in the batch buffer addresses. >>>>>> + * >>>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >>>>>> + * I915_EXEC_BATCH_FIRST of >>>> &drm_i915_gem_execbuffer2.flags must be 0 >>>>>> + * (not used) in VM_BIND mode. >>>> I915_EXEC_USE_EXTENSIONS flag must always be >>>>>> + * set (See struct >>>> drm_i915_gem_execbuffer_ext_batch_addresses). >>>>>> + * The buffers_ptr, buffer_count, >>>> batch_start_offset and batch_len fields >>>>>> + * of struct drm_i915_gem_execbuffer2 are also not >>>> used and must be 0. >>>>>> + */ >>>>> >>>>> From that description, it seems we have: >>>>> >>>>> struct drm_i915_gem_execbuffer2 { >>>>> __u64 buffers_ptr; -> must be 0 (new) >>>>> __u32 buffer_count; -> must be 0 (new) >>>>> __u32 batch_start_offset; -> must be 0 (new) >>>>> __u32 batch_len; -> must be 0 (new) >>>>> __u32 DR1; -> must be 0 (old) >>>>> __u32 DR4; -> must be 0 (old) >>>>> __u32 num_cliprects; (fences) -> must be 0 >>>> since using extensions >>>>> __u64 cliprects_ptr; (fences, extensions) -> >>>> contains an actual pointer! >>>>> __u64 flags; -> some flags >>>> must be 0 (new) >>>>> __u64 rsvd1; (context info) -> repurposed field >>>>> (old) >>>>> __u64 rsvd2; -> unused >>>>> }; >>>>> >>>>> Based on that, why can't we just get >>>> drm_i915_gem_execbuffer3 instead >>>>> of adding even more complexity to an already abused >>>> interface? While >>>>> the Vulkan-like extension thing is really nice, I don't think >>>>> what >>>>> we're doing here is extending the ioctl usage, we're completely >>>>> changing how the base struct should be interpreted >>>> based on how the VM >>>>> was created (which is an entirely different ioctl). >>>>> >>>>> From Rusty Russel's API Design grading, >>>> drm_i915_gem_execbuffer2 is >>>>> already at -6 without these changes. I think after >>>> vm_bind we'll need >>>>> to create a -11 entry just to deal with this ioctl. >>>>> >>>> >>>> The only change here is removing the execlist support for VM_BIND >>>> mode (other than natual extensions). >>>> Adding a new execbuffer3 was considered, but I think we need >>>> to be careful >>>> with that as that goes beyond the VM_BIND support, including >>>> any future >>>> requirements (as we don't want an execbuffer4 after VM_BIND). >>> >>> Why not? it's not like adding extensions here is really that >>> different >>> than adding new ioctls. >>> >>> I definitely think this deserves an execbuffer3 without even >>> considering future requirements. Just to burn down the old >>> requirements and pointless fields. >>> >>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, >>> leave the >>> older sw on execbuf2 for ever. >> >> I guess another point in favour of execbuf3 would be that it's less >> midlayer. If we share the entry point then there's quite a few >> vfuncs >> needed to cleanly split out the vm_bind paths from the legacy >> reloc/softping paths. >> >> If we invert this and do execbuf3, then there's the existing ioctl >> vfunc, and then we share code (where it even makes sense, probably >> request setup/submit need to be shared, anything else is probably >> cleaner to just copypaste) with the usual helper approach. >> >> Also that would guarantee that really none of the old concepts like >> i915_active on the vma or vma open counts and all that stuff leaks >> into the new vm_bind execbuf. >> >> Finally I also think that copypasting would make backporting >> easier, >> or at least more flexible, since it should make it easier to >> have the >> upstream vm_bind co-exist with all the other things we have. >> Without >> huge amounts of conflicts (or at least much less) that pushing a >> pile >> of vfuncs into the existing code would cause. >> >> So maybe we should do this? > > Thanks Dave, Daniel. > There are a few things that will be common between execbuf2 and > execbuf3, like request setup/submit (as you said), fence handling > (timeline fences, fence array, composite fences), engine selection, > etc. Also, many of the 'flags' will be there in execbuf3 also (but > bit position will differ). > But I guess these should be fine as the suggestion here is to > copy-paste the execbuff code and having a shared code where > possible. > Besides, we can stop supporting some older feature in execbuff3 > (like fence array in favor of newer timeline fences), which will > further reduce common code. > > Ok, I will update this series by adding execbuf3 and send out soon. >
Does this sound reasonable?
Thanks for proposing this. Some comments below.
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */
__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */
__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)
Shouldn't we use the new engine selection uAPI instead?
We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in drm_i915_gem_context_create_ext_setparam.
And you can also create virtual engines with the same extension.
It feels like this could be a single u32 with the engine index (in the context engine map).
Yes I said the same yesterday.
Also note that as you can't any longer set engines on a default context, question is whether userspace cares to use execbuf3 with it (default context).
If it does, it will need an alternative engine selection for that case. I was proposing class:instance rather than legacy cumbersome flags.
If it does not, I mean if the decision is to only allow execbuf3 with engine maps, then it leaves the default context a waste of kernel memory in the execbuf3 future. :( Don't know what to do there..
Regards,
Tvrtko
Thanks Tvrtko, I only saw your reply after responding.
Both Iris & Anv create a context with engines (if kernel supports it) : https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/common/intel_...
I think we should be fine with just a single engine id and we don't care about the default context.
I wonder if in this case we could stop creating the default context starting from a future "gen"? Otherwise, with engine map only execbuf3 and execbuf3 only userspace, it would serve no purpose apart from wasting kernel memory.
Thanks Tvrtko, Lionell.
I will be glad to remove these flags, just define a uint32 engine_id and mandate a context with user engines map.
Regarding removing the default context, yah, it depends on from which gen onwards we will only be supporting execbuf3 and execbuf2 is fully deprecated. Till then, we will have to keep it I guess :(.
Forgot about this sub-thread.. I think it could be removed before execbuf2 is fully deprecated. We can make that decision with any new platform which needs UMD stack updates to be supported. But it is work for us to adjust IGT so I am not hopeful anyone will tackle it. We will just end up wasting memory.
Regards,
Tvrtko
On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: > VM_BIND and related uapi definitions > > v2: Ensure proper kernel-doc formatting with cross references. > Also add new uapi and documentation as per review comments > from Daniel. > > Signed-off-by: Niranjana Vishwanathapura
niranjana.vishwanathapura@intel.com
> --- > Documentation/gpu/rfc/i915_vm_bind.h | 399
+++++++++++++++++++++++++++
> 1 file changed, 399 insertions(+) > create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h > > diff --git a/Documentation/gpu/rfc/i915_vm_bind.h
b/Documentation/gpu/rfc/i915_vm_bind.h
> new file mode 100644 > index 000000000000..589c0a009107 > --- /dev/null > +++ b/Documentation/gpu/rfc/i915_vm_bind.h > @@ -0,0 +1,399 @@ > +/* SPDX-License-Identifier: MIT */ > +/* > + * Copyright © 2022 Intel Corporation > + */ > + > +/** > + * DOC: I915_PARAM_HAS_VM_BIND > + * > + * VM_BIND feature availability. > + * See typedef drm_i915_getparam_t param. > + */ > +#define I915_PARAM_HAS_VM_BIND 57 > + > +/** > + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND > + * > + * Flag to opt-in for VM_BIND mode of binding during VM creation. > + * See struct drm_i915_gem_vm_control flags. > + * > + * A VM in VM_BIND mode will not support the older execbuff
mode of binding.
> + * In VM_BIND mode, execbuff ioctl will not accept any
execlist (ie., the
> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). > + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and > + * &drm_i915_gem_execbuffer2.batch_len must be 0. > + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must
be provided
> + * to pass in the batch buffer addresses. > + * > + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and > + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags
must be 0
> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag
must always be
> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). > + * The buffers_ptr, buffer_count, batch_start_offset and
batch_len fields
> + * of struct drm_i915_gem_execbuffer2 are also not used and
must be 0.
> + */
From that description, it seems we have:
struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using
extensions
__u64 cliprects_ptr; (fences, extensions) -> contains an
actual pointer!
__u64 flags; -> some flags must be 0
(new)
__u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };
Based on that, why can't we just get drm_i915_gem_execbuffer3
instead
of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how
the VM
was created (which is an entirely different ioctl).
From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll
need
to create a -11 entry just to deal with this ioctl.
The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).
Why not? it's not like adding extensions here is really that different than adding new ioctls.
I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.
Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.
I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.
If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.
Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.
Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.
So maybe we should do this?
Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.
Ok, I will update this series by adding execbuf3 and send out soon.
Does this sound reasonable?
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */
__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */
Quick question raised on IRC about the batches : Are multiple batches limited to virtual engines?
Thanks,
-Lionel
__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)
#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)
#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11) #define I915_EXEC3_FENCE_SUBMIT (1<<12)
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */
__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };
With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).
I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
Any thing else needs to be added or removed?
Niranjana
Niranjana
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Jun 08, 2022 at 10:12:45AM +0300, Lionel Landwerlin wrote:
On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: >> VM_BIND and related uapi definitions >> >> v2: Ensure proper kernel-doc formatting with cross references. >> Also add new uapi and documentation as per review comments >> from Daniel. >> >> Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com >> --- >> Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ >> 1 file changed, 399 insertions(+) >> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h >> new file mode 100644 >> index 000000000000..589c0a009107 >> --- /dev/null >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >> @@ -0,0 +1,399 @@ >> +/* SPDX-License-Identifier: MIT */ >> +/* >> + * Copyright © 2022 Intel Corporation >> + */ >> + >> +/** >> + * DOC: I915_PARAM_HAS_VM_BIND >> + * >> + * VM_BIND feature availability. >> + * See typedef drm_i915_getparam_t param. >> + */ >> +#define I915_PARAM_HAS_VM_BIND 57 >> + >> +/** >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >> + * >> + * Flag to opt-in for VM_BIND mode of binding during VM creation. >> + * See struct drm_i915_gem_vm_control flags. >> + * >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding. >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided >> + * to pass in the batch buffer addresses. >> + * >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0 >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0. >> + */ > >From that description, it seems we have: > >struct drm_i915_gem_execbuffer2 { > __u64 buffers_ptr; -> must be 0 (new) > __u32 buffer_count; -> must be 0 (new) > __u32 batch_start_offset; -> must be 0 (new) > __u32 batch_len; -> must be 0 (new) > __u32 DR1; -> must be 0 (old) > __u32 DR4; -> must be 0 (old) > __u32 num_cliprects; (fences) -> must be 0 since using extensions > __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! > __u64 flags; -> some flags must be 0 (new) > __u64 rsvd1; (context info) -> repurposed field (old) > __u64 rsvd2; -> unused >}; > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead >of adding even more complexity to an already abused interface? While >the Vulkan-like extension thing is really nice, I don't think what >we're doing here is extending the ioctl usage, we're completely >changing how the base struct should be interpreted based on how the VM >was created (which is an entirely different ioctl). > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is >already at -6 without these changes. I think after vm_bind we'll need >to create a -11 entry just to deal with this ioctl. >
The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).
Why not? it's not like adding extensions here is really that different than adding new ioctls.
I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.
Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.
I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.
If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.
Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.
Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.
So maybe we should do this?
Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.
Ok, I will update this series by adding execbuf3 and send out soon.
Does this sound reasonable?
struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */
__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */
Quick question raised on IRC about the batches : Are multiple batches limited to virtual engines?
Parallel engines, see i915_context_engines_parallel_submit in i915_drm.h.
Currently the media UMD uses this uAPI to do split frame (e.g. run multiple batches in parallel on the video engines to decode a 8k frame).
Of course there could be future users of this uAPI too.
Matt
Thanks,
-Lionel
__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)
#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)
#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)
#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11) #define I915_EXEC3_FENCE_SUBMIT (1<<12)
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */
__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };
With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).
I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?
Any thing else needs to be added or removed?
Niranjana
Niranjana
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
VM_BIND and related uapi definitions
v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/**
- DOC: I915_PARAM_HAS_VM_BIND
- VM_BIND feature availability.
- See typedef drm_i915_getparam_t param.
- */
+#define I915_PARAM_HAS_VM_BIND 57
+/**
- DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
- Flag to opt-in for VM_BIND mode of binding during VM creation.
- See struct drm_i915_gem_vm_control flags.
- A VM in VM_BIND mode will not support the older execbuff mode of binding.
- In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
- &drm_i915_gem_execbuffer2.buffer_count must be 0).
- Also, &drm_i915_gem_execbuffer2.batch_start_offset and
- &drm_i915_gem_execbuffer2.batch_len must be 0.
- DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
- to pass in the batch buffer addresses.
- Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
- I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
- (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
- set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
- The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
- of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
- */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0)
+/**
- DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
- Flag to declare context as long running.
- See struct drm_i915_gem_context_create_ext flags.
- Usage of dma-fence expects that they complete in reasonable amount of time.
- Compute on the other hand can be long running. Hence it is not appropriate
- for compute contexts to export request completion dma-fence to user.
- The dma-fence usage will be limited to in-kernel consumption only.
- Compute contexts need to use user/memory fence.
- So, long running contexts do not support output fences. Hence,
- I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
- I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
- to be not used.
- DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
- to long running contexts.
- */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2)
+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f
+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+/**
- struct drm_i915_gem_vm_bind - VA to object mapping to bind.
- This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
- virtual address (VA) range to the section of an object that should be bound
- in the device page table of the specified address space (VM).
- The VA range specified must be unique (ie., not currently bound) and can
- be mapped to whole object or a section of the object (partial binding).
- Multiple VA mappings can be created to the same section of the object
- (aliasing).
- */
+struct drm_i915_gem_vm_bind {
- /** @vm_id: VM (address space) id to bind */
- __u32 vm_id;
- /** @handle: Object handle */
- __u32 handle;
- /** @start: Virtual Address start to bind */
- __u64 start;
- /** @offset: Offset in object to bind */
- __u64 offset;
- /** @length: Length of mapping to bind */
- __u64 length;
Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the remainder of the space to a dummy object? In which case would there be any alignment/padding issues preventing the two bind to be placed next to each other?
I ask because someone from the compute side asked me about a problem with their strategy of dealing with overfetch and I suggested pad to size.
Regards,
Tvrtko
- /**
* @flags: Supported flags are,
*
* I915_GEM_VM_BIND_READONLY:
* Mapping is read-only.
*
* I915_GEM_VM_BIND_CAPTURE:
* Capture this mapping in the dump upon GPU error.
*/
- __u64 flags;
+#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1)
- /** @extensions: 0-terminated chain of extensions for this mapping. */
- __u64 extensions;
+};
+/**
- struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
- This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
- address (VA) range that should be unbound from the device page table of the
- specified address space (VM). The specified VA range must match one of the
- mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
- completion.
- */
+struct drm_i915_gem_vm_unbind {
- /** @vm_id: VM (address space) id to bind */
- __u32 vm_id;
- /** @rsvd: Reserved for future use; must be zero. */
- __u32 rsvd;
- /** @start: Virtual Address start to unbind */
- __u64 start;
- /** @length: Length of mapping to unbind */
- __u64 length;
- /** @flags: reserved for future usage, currently MBZ */
- __u64 flags;
- /** @extensions: 0-terminated chain of extensions for this mapping. */
- __u64 extensions;
+};
+/**
- struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
- or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for input fence to signal
- before starting the binding or unbinding.
- The vm_bind or vm_unbind async worker will signal the returned output fence
- after the completion of binding or unbinding.
- */
+struct drm_i915_vm_bind_fence {
- /** @handle: User's handle for a drm_syncobj to wait on or signal. */
- __u32 handle;
- /**
* @flags: Supported flags are,
*
* I915_VM_BIND_FENCE_WAIT:
* Wait for the input fence before binding/unbinding
*
* I915_VM_BIND_FENCE_SIGNAL:
* Return bind/unbind completion fence as output
*/
- __u32 flags;
+#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};
+/**
- struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
- and vm_unbind.
- This structure describes an array of timeline drm_syncobj and associated
- points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
- can be input or output fences (See struct drm_i915_vm_bind_fence).
- */
+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /**
* @fence_count: Number of elements in the @handles_ptr & @value_ptr
* arrays.
*/
- __u64 fence_count;
- /**
* @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
* of length @fence_count.
*/
- __u64 handles_ptr;
- /**
* @values_ptr: Pointer to an array of u64 values of length
* @fence_count.
* Values must be 0 for a binary drm_syncobj. A Value of 0 for a
* timeline drm_syncobj is invalid as it turns a drm_syncobj into a
* binary one.
*/
- __u64 values_ptr;
+};
+/**
- struct drm_i915_vm_bind_user_fence - An input or output user fence for the
- vm_bind or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for the input fence (value at
- @addr to become equal to @val) before starting the binding or unbinding.
- The vm_bind or vm_unbind async worker will signal the output fence after
- the completion of binding or unbinding by writing @val to memory location at
- @addr
- */
+struct drm_i915_vm_bind_user_fence {
- /** @addr: User/Memory fence qword aligned process virtual address */
- __u64 addr;
- /** @val: User/Memory fence value to be written after bind completion */
- __u64 val;
- /**
* @flags: Supported flags are,
*
* I915_VM_BIND_USER_FENCE_WAIT:
* Wait for the input fence before binding/unbinding
*
* I915_VM_BIND_USER_FENCE_SIGNAL:
* Return bind/unbind completion fence as output
*/
- __u32 flags;
+#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
- (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
+};
+/**
- struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
- and vm_unbind.
- These user fences can be input or output fences
- (See struct drm_i915_vm_bind_user_fence).
- */
+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /** @fence_count: Number of elements in the @user_fence_ptr array. */
- __u64 fence_count;
- /**
* @user_fence_ptr: Pointer to an array of
* struct drm_i915_vm_bind_user_fence of length @fence_count.
*/
- __u64 user_fence_ptr;
+};
+/**
- struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
- gpu virtual addresses.
- In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
- must always be appended in the VM_BIND mode and it will be an error to
- append this extension in older non-VM_BIND mode.
- */
+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /** @count: Number of addresses in the addr array. */
- __u32 count;
- /** @addr: An array of batch gpu virtual addresses. */
- __u64 addr[0];
+};
+/**
- struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
- signaling extension.
- This extension allows user to attach a user fence (@addr, @value pair) to an
- execbuf to be signaled by the command streamer after the completion of first
- level batch, by writing the @value at specified @addr and triggering an
- interrupt.
- User can either poll for this user fence to signal or can also wait on it
- with i915_gem_wait_user_fence ioctl.
- This is very much usefaul for long running contexts where waiting on dma-fence
- by user (like i915_gem_wait ioctl) is not supported.
- */
+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /**
* @addr: User/Memory fence qword aligned GPU virtual address.
*
* Address has to be a valid GPU virtual address at the time of
* first level batch completion.
*/
- __u64 addr;
- /**
* @value: User/Memory fence Value to be written to above address
* after first level batch completes.
*/
- __u64 value;
- /** @rsvd: Reserved for future extensions, MBZ */
- __u64 rsvd;
+};
+/**
- struct drm_i915_gem_create_ext_vm_private - Extension to make the object
- private to the specified VM.
- See struct drm_i915_gem_create_ext.
- */
+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /** @vm_id: Id of the VM to which the object is private */
- __u32 vm_id;
+};
+/**
- struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
- User/Memory fence can be woken up either by:
- GPU context indicated by @ctx_id, or,
- Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
- @ctx_id is ignored when this flag is set.
- Wakeup condition is,
- ``((*addr & mask) op (value & mask))``
- See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
- */
+struct drm_i915_gem_wait_user_fence {
- /** @extensions: Zero-terminated chain of extensions. */
- __u64 extensions;
- /** @addr: User/Memory fence address */
- __u64 addr;
- /** @ctx_id: Id of the Context which will signal the fence. */
- __u32 ctx_id;
- /** @op: Wakeup condition operator */
- __u16 op;
+#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7
- /**
* @flags: Supported flags are,
*
* I915_UFENCE_WAIT_SOFT:
*
* To be woken up by i915 driver async worker (not by GPU).
*
* I915_UFENCE_WAIT_ABSTIME:
*
* Wait timeout specified as absolute time.
*/
- __u16 flags;
+#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2
- /** @value: Wakeup value */
- __u64 value;
- /** @mask: Wakeup mask */
- __u64 mask;
+#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull
- /**
* @timeout: Wait timeout in nanoseconds.
*
* If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
* absolute time in nsec.
*/
- __s64 timeout;
+};
On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
VM_BIND and related uapi definitions
v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/**
- DOC: I915_PARAM_HAS_VM_BIND
- VM_BIND feature availability.
- See typedef drm_i915_getparam_t param.
- */
+#define I915_PARAM_HAS_VM_BIND 57
+/**
- DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
- Flag to opt-in for VM_BIND mode of binding during VM creation.
- See struct drm_i915_gem_vm_control flags.
- A VM in VM_BIND mode will not support the older execbuff mode of binding.
- In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
- &drm_i915_gem_execbuffer2.buffer_count must be 0).
- Also, &drm_i915_gem_execbuffer2.batch_start_offset and
- &drm_i915_gem_execbuffer2.batch_len must be 0.
- DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
- to pass in the batch buffer addresses.
- Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
- I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
- (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
- set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
- The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
- of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
- */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0)
+/**
- DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
- Flag to declare context as long running.
- See struct drm_i915_gem_context_create_ext flags.
- Usage of dma-fence expects that they complete in reasonable amount of time.
- Compute on the other hand can be long running. Hence it is not appropriate
- for compute contexts to export request completion dma-fence to user.
- The dma-fence usage will be limited to in-kernel consumption only.
- Compute contexts need to use user/memory fence.
- So, long running contexts do not support output fences. Hence,
- I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
- I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
- to be not used.
- DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
- to long running contexts.
- */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2)
+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f
+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+/**
- struct drm_i915_gem_vm_bind - VA to object mapping to bind.
- This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
- virtual address (VA) range to the section of an object that should be bound
- in the device page table of the specified address space (VM).
- The VA range specified must be unique (ie., not currently bound) and can
- be mapped to whole object or a section of the object (partial binding).
- Multiple VA mappings can be created to the same section of the object
- (aliasing).
- */
+struct drm_i915_gem_vm_bind {
- /** @vm_id: VM (address space) id to bind */
- __u32 vm_id;
- /** @handle: Object handle */
- __u32 handle;
- /** @start: Virtual Address start to bind */
- __u64 start;
- /** @offset: Offset in object to bind */
- __u64 offset;
- /** @length: Length of mapping to bind */
- __u64 length;
Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the remainder of the space to a dummy object? In which case would there be any alignment/padding issues preventing the two bind to be placed next to each other?
I ask because someone from the compute side asked me about a problem with their strategy of dealing with overfetch and I suggested pad to size.
Thanks Tvrtko, I think we shouldn't be needing it. As with VM_BIND VA assignment is completely pushed to userspace, no padding should be necessary once the 'start' and 'size' alignment conditions are met.
I will add some documentation on alignment requirement here. Generally, 'start' and 'size' should be 4K aligned. But, I think when we have 64K lmem page sizes (dg2 and xehpsdv), they need to be 64K aligned.
Niranjana
Regards,
Tvrtko
- /**
* @flags: Supported flags are,
*
* I915_GEM_VM_BIND_READONLY:
* Mapping is read-only.
*
* I915_GEM_VM_BIND_CAPTURE:
* Capture this mapping in the dump upon GPU error.
*/
- __u64 flags;
+#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1)
- /** @extensions: 0-terminated chain of extensions for this mapping. */
- __u64 extensions;
+};
+/**
- struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
- This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
- address (VA) range that should be unbound from the device page table of the
- specified address space (VM). The specified VA range must match one of the
- mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
- completion.
- */
+struct drm_i915_gem_vm_unbind {
- /** @vm_id: VM (address space) id to bind */
- __u32 vm_id;
- /** @rsvd: Reserved for future use; must be zero. */
- __u32 rsvd;
- /** @start: Virtual Address start to unbind */
- __u64 start;
- /** @length: Length of mapping to unbind */
- __u64 length;
- /** @flags: reserved for future usage, currently MBZ */
- __u64 flags;
- /** @extensions: 0-terminated chain of extensions for this mapping. */
- __u64 extensions;
+};
+/**
- struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
- or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for input fence to signal
- before starting the binding or unbinding.
- The vm_bind or vm_unbind async worker will signal the returned output fence
- after the completion of binding or unbinding.
- */
+struct drm_i915_vm_bind_fence {
- /** @handle: User's handle for a drm_syncobj to wait on or signal. */
- __u32 handle;
- /**
* @flags: Supported flags are,
*
* I915_VM_BIND_FENCE_WAIT:
* Wait for the input fence before binding/unbinding
*
* I915_VM_BIND_FENCE_SIGNAL:
* Return bind/unbind completion fence as output
*/
- __u32 flags;
+#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};
+/**
- struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
- and vm_unbind.
- This structure describes an array of timeline drm_syncobj and associated
- points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
- can be input or output fences (See struct drm_i915_vm_bind_fence).
- */
+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /**
* @fence_count: Number of elements in the @handles_ptr & @value_ptr
* arrays.
*/
- __u64 fence_count;
- /**
* @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
* of length @fence_count.
*/
- __u64 handles_ptr;
- /**
* @values_ptr: Pointer to an array of u64 values of length
* @fence_count.
* Values must be 0 for a binary drm_syncobj. A Value of 0 for a
* timeline drm_syncobj is invalid as it turns a drm_syncobj into a
* binary one.
*/
- __u64 values_ptr;
+};
+/**
- struct drm_i915_vm_bind_user_fence - An input or output user fence for the
- vm_bind or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for the input fence (value at
- @addr to become equal to @val) before starting the binding or unbinding.
- The vm_bind or vm_unbind async worker will signal the output fence after
- the completion of binding or unbinding by writing @val to memory location at
- @addr
- */
+struct drm_i915_vm_bind_user_fence {
- /** @addr: User/Memory fence qword aligned process virtual address */
- __u64 addr;
- /** @val: User/Memory fence value to be written after bind completion */
- __u64 val;
- /**
* @flags: Supported flags are,
*
* I915_VM_BIND_USER_FENCE_WAIT:
* Wait for the input fence before binding/unbinding
*
* I915_VM_BIND_USER_FENCE_SIGNAL:
* Return bind/unbind completion fence as output
*/
- __u32 flags;
+#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
- (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
+};
+/**
- struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
- and vm_unbind.
- These user fences can be input or output fences
- (See struct drm_i915_vm_bind_user_fence).
- */
+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /** @fence_count: Number of elements in the @user_fence_ptr array. */
- __u64 fence_count;
- /**
* @user_fence_ptr: Pointer to an array of
* struct drm_i915_vm_bind_user_fence of length @fence_count.
*/
- __u64 user_fence_ptr;
+};
+/**
- struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
- gpu virtual addresses.
- In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
- must always be appended in the VM_BIND mode and it will be an error to
- append this extension in older non-VM_BIND mode.
- */
+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /** @count: Number of addresses in the addr array. */
- __u32 count;
- /** @addr: An array of batch gpu virtual addresses. */
- __u64 addr[0];
+};
+/**
- struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
- signaling extension.
- This extension allows user to attach a user fence (@addr, @value pair) to an
- execbuf to be signaled by the command streamer after the completion of first
- level batch, by writing the @value at specified @addr and triggering an
- interrupt.
- User can either poll for this user fence to signal or can also wait on it
- with i915_gem_wait_user_fence ioctl.
- This is very much usefaul for long running contexts where waiting on dma-fence
- by user (like i915_gem_wait ioctl) is not supported.
- */
+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /**
* @addr: User/Memory fence qword aligned GPU virtual address.
*
* Address has to be a valid GPU virtual address at the time of
* first level batch completion.
*/
- __u64 addr;
- /**
* @value: User/Memory fence Value to be written to above address
* after first level batch completes.
*/
- __u64 value;
- /** @rsvd: Reserved for future extensions, MBZ */
- __u64 rsvd;
+};
+/**
- struct drm_i915_gem_create_ext_vm_private - Extension to make the object
- private to the specified VM.
- See struct drm_i915_gem_create_ext.
- */
+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /** @vm_id: Id of the VM to which the object is private */
- __u32 vm_id;
+};
+/**
- struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
- User/Memory fence can be woken up either by:
- GPU context indicated by @ctx_id, or,
- Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
- @ctx_id is ignored when this flag is set.
- Wakeup condition is,
- ``((*addr & mask) op (value & mask))``
- See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
- */
+struct drm_i915_gem_wait_user_fence {
- /** @extensions: Zero-terminated chain of extensions. */
- __u64 extensions;
- /** @addr: User/Memory fence address */
- __u64 addr;
- /** @ctx_id: Id of the Context which will signal the fence. */
- __u32 ctx_id;
- /** @op: Wakeup condition operator */
- __u16 op;
+#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7
- /**
* @flags: Supported flags are,
*
* I915_UFENCE_WAIT_SOFT:
*
* To be woken up by i915 driver async worker (not by GPU).
*
* I915_UFENCE_WAIT_ABSTIME:
*
* Wait timeout specified as absolute time.
*/
- __u16 flags;
+#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2
- /** @value: Wakeup value */
- __u64 value;
- /** @mask: Wakeup mask */
- __u64 mask;
+#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull
- /**
* @timeout: Wait timeout in nanoseconds.
*
* If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
* absolute time in nsec.
*/
- __s64 timeout;
+};
On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
VM_BIND and related uapi definitions
v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura
niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/**
- DOC: I915_PARAM_HAS_VM_BIND
- VM_BIND feature availability.
- See typedef drm_i915_getparam_t param.
- */
+#define I915_PARAM_HAS_VM_BIND 57
+/**
- DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
- Flag to opt-in for VM_BIND mode of binding during VM creation.
- See struct drm_i915_gem_vm_control flags.
- A VM in VM_BIND mode will not support the older execbuff mode of
binding.
- In VM_BIND mode, execbuff ioctl will not accept any execlist
(ie., the
- &drm_i915_gem_execbuffer2.buffer_count must be 0).
- Also, &drm_i915_gem_execbuffer2.batch_start_offset and
- &drm_i915_gem_execbuffer2.batch_len must be 0.
- DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be
provided
- to pass in the batch buffer addresses.
- Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
- I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
- (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must
always be
- set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
- The buffers_ptr, buffer_count, batch_start_offset and batch_len
fields
- of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
- */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0)
+/**
- DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
- Flag to declare context as long running.
- See struct drm_i915_gem_context_create_ext flags.
- Usage of dma-fence expects that they complete in reasonable
amount of time.
- Compute on the other hand can be long running. Hence it is not
appropriate
- for compute contexts to export request completion dma-fence to user.
- The dma-fence usage will be limited to in-kernel consumption only.
- Compute contexts need to use user/memory fence.
- So, long running contexts do not support output fences. Hence,
- I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
- I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are
expected
- to be not used.
- DRM_I915_GEM_WAIT ioctl call is also not supported for objects
mapped
- to long running contexts.
- */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2)
+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f
+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE
- DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+/**
- struct drm_i915_gem_vm_bind - VA to object mapping to bind.
- This structure is passed to VM_BIND ioctl and specifies the
mapping of GPU
- virtual address (VA) range to the section of an object that
should be bound
- in the device page table of the specified address space (VM).
- The VA range specified must be unique (ie., not currently bound)
and can
- be mapped to whole object or a section of the object (partial
binding).
- Multiple VA mappings can be created to the same section of the
object
- (aliasing).
- */
+struct drm_i915_gem_vm_bind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id;
+ /** @handle: Object handle */ + __u32 handle;
+ /** @start: Virtual Address start to bind */ + __u64 start;
+ /** @offset: Offset in object to bind */ + __u64 offset;
+ /** @length: Length of mapping to bind */ + __u64 length;
Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the remainder of the space to a dummy object? In which case would there be any alignment/padding issues preventing the two bind to be placed next to each other?
I ask because someone from the compute side asked me about a problem with their strategy of dealing with overfetch and I suggested pad to size.
Thanks Tvrtko, I think we shouldn't be needing it. As with VM_BIND VA assignment is completely pushed to userspace, no padding should be necessary once the 'start' and 'size' alignment conditions are met.
I will add some documentation on alignment requirement here. Generally, 'start' and 'size' should be 4K aligned. But, I think when we have 64K lmem page sizes (dg2 and xehpsdv), they need to be 64K aligned.
+ Matt
Align to 64k is enough for all overfetch issues?
Apparently compute has a situation where a buffer is received by one component and another has to apply more alignment to it, to deal with overfetch. Since they cannot grow the actual BO if they wanted to VM_BIND a scratch area on top? Or perhaps none of this is a problem on discrete and original BO should be correctly allocated to start with.
Side question - what about the align to 2MiB mentioned in i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to discrete?
Regards,
Tvrtko
Niranjana
Regards,
Tvrtko
+ /** + * @flags: Supported flags are, + * + * I915_GEM_VM_BIND_READONLY: + * Mapping is read-only. + * + * I915_GEM_VM_BIND_CAPTURE: + * Capture this mapping in the dump upon GPU error. + */ + __u64 flags; +#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1)
+ /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +};
+/**
- struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
- This structure is passed to VM_UNBIND ioctl and specifies the GPU
virtual
- address (VA) range that should be unbound from the device page
table of the
- specified address space (VM). The specified VA range must match
one of the
- mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
- completion.
- */
+struct drm_i915_gem_vm_unbind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id;
+ /** @rsvd: Reserved for future use; must be zero. */ + __u32 rsvd;
+ /** @start: Virtual Address start to unbind */ + __u64 start;
+ /** @length: Length of mapping to unbind */ + __u64 length;
+ /** @flags: reserved for future usage, currently MBZ */ + __u64 flags;
+ /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +};
+/**
- struct drm_i915_vm_bind_fence - An input or output fence for the
vm_bind
- or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for input fence to
signal
- before starting the binding or unbinding.
- The vm_bind or vm_unbind async worker will signal the returned
output fence
- after the completion of binding or unbinding.
- */
+struct drm_i915_vm_bind_fence { + /** @handle: User's handle for a drm_syncobj to wait on or signal. */ + __u32 handle;
+ /** + * @flags: Supported flags are, + * + * I915_VM_BIND_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};
+/**
- struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for
vm_bind
- and vm_unbind.
- This structure describes an array of timeline drm_syncobj and
associated
- points for timeline variants of drm_syncobj. These timeline
'drm_syncobj's
- can be input or output fences (See struct drm_i915_vm_bind_fence).
- */
+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** + * @fence_count: Number of elements in the @handles_ptr & @value_ptr + * arrays. + */ + __u64 fence_count;
+ /** + * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence + * of length @fence_count. + */ + __u64 handles_ptr;
+ /** + * @values_ptr: Pointer to an array of u64 values of length + * @fence_count. + * Values must be 0 for a binary drm_syncobj. A Value of 0 for a + * timeline drm_syncobj is invalid as it turns a drm_syncobj into a + * binary one. + */ + __u64 values_ptr; +};
+/**
- struct drm_i915_vm_bind_user_fence - An input or output user
fence for the
- vm_bind or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for the input
fence (value at
- @addr to become equal to @val) before starting the binding or
unbinding.
- The vm_bind or vm_unbind async worker will signal the output
fence after
- the completion of binding or unbinding by writing @val to memory
location at
- @addr
- */
+struct drm_i915_vm_bind_user_fence { + /** @addr: User/Memory fence qword aligned process virtual address */ + __u64 addr;
+ /** @val: User/Memory fence value to be written after bind completion */ + __u64 val;
+ /** + * @flags: Supported flags are, + * + * I915_VM_BIND_USER_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_USER_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ + (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) +};
+/**
- struct drm_i915_vm_bind_ext_user_fence - User/memory fences for
vm_bind
- and vm_unbind.
- These user fences can be input or output fences
- (See struct drm_i915_vm_bind_user_fence).
- */
+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** @fence_count: Number of elements in the @user_fence_ptr array. */ + __u64 fence_count;
+ /** + * @user_fence_ptr: Pointer to an array of + * struct drm_i915_vm_bind_user_fence of length @fence_count. + */ + __u64 user_fence_ptr; +};
+/**
- struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of
batch buffer
- gpu virtual addresses.
- In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this
extension
- must always be appended in the VM_BIND mode and it will be an
error to
- append this extension in older non-VM_BIND mode.
- */
+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** @count: Number of addresses in the addr array. */ + __u32 count;
+ /** @addr: An array of batch gpu virtual addresses. */ + __u64 addr[0]; +};
+/**
- struct drm_i915_gem_execbuffer_ext_user_fence - First level batch
completion
- signaling extension.
- This extension allows user to attach a user fence (@addr, @value
pair) to an
- execbuf to be signaled by the command streamer after the
completion of first
- level batch, by writing the @value at specified @addr and
triggering an
- interrupt.
- User can either poll for this user fence to signal or can also
wait on it
- with i915_gem_wait_user_fence ioctl.
- This is very much usefaul for long running contexts where waiting
on dma-fence
- by user (like i915_gem_wait ioctl) is not supported.
- */
+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** + * @addr: User/Memory fence qword aligned GPU virtual address. + * + * Address has to be a valid GPU virtual address at the time of + * first level batch completion. + */ + __u64 addr;
+ /** + * @value: User/Memory fence Value to be written to above address + * after first level batch completes. + */ + __u64 value;
+ /** @rsvd: Reserved for future extensions, MBZ */ + __u64 rsvd; +};
+/**
- struct drm_i915_gem_create_ext_vm_private - Extension to make the
object
- private to the specified VM.
- See struct drm_i915_gem_create_ext.
- */
+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** @vm_id: Id of the VM to which the object is private */ + __u32 vm_id; +};
+/**
- struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
- User/Memory fence can be woken up either by:
- GPU context indicated by @ctx_id, or,
- Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
- * @ctx_id is ignored when this flag is set.
- Wakeup condition is,
- ``((*addr & mask) op (value & mask))``
- See :ref:`Documentation/driver-api/dma-buf.rst
<indefinite_dma_fences>`
- */
+struct drm_i915_gem_wait_user_fence { + /** @extensions: Zero-terminated chain of extensions. */ + __u64 extensions;
+ /** @addr: User/Memory fence address */ + __u64 addr;
+ /** @ctx_id: Id of the Context which will signal the fence. */ + __u32 ctx_id;
+ /** @op: Wakeup condition operator */ + __u16 op; +#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7
+ /** + * @flags: Supported flags are, + * + * I915_UFENCE_WAIT_SOFT: + * + * To be woken up by i915 driver async worker (not by GPU). + * + * I915_UFENCE_WAIT_ABSTIME: + * + * Wait timeout specified as absolute time. + */ + __u16 flags; +#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2
+ /** @value: Wakeup value */ + __u64 value;
+ /** @mask: Wakeup mask */ + __u64 mask; +#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull
+ /** + * @timeout: Wait timeout in nanoseconds. + * + * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the + * absolute time in nsec. + */ + __s64 timeout; +};
On 08/06/2022 08:17, Tvrtko Ursulin wrote:
On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
VM_BIND and related uapi definitions
v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura
niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/**
- DOC: I915_PARAM_HAS_VM_BIND
- VM_BIND feature availability.
- See typedef drm_i915_getparam_t param.
- */
+#define I915_PARAM_HAS_VM_BIND 57
+/**
- DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
- Flag to opt-in for VM_BIND mode of binding during VM creation.
- See struct drm_i915_gem_vm_control flags.
- A VM in VM_BIND mode will not support the older execbuff mode of
binding.
- In VM_BIND mode, execbuff ioctl will not accept any execlist
(ie., the
- &drm_i915_gem_execbuffer2.buffer_count must be 0).
- Also, &drm_i915_gem_execbuffer2.batch_start_offset and
- &drm_i915_gem_execbuffer2.batch_len must be 0.
- DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be
provided
- to pass in the batch buffer addresses.
- Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
- I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
- (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must
always be
- set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
- The buffers_ptr, buffer_count, batch_start_offset and batch_len
fields
- of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
- */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0)
+/**
- DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
- Flag to declare context as long running.
- See struct drm_i915_gem_context_create_ext flags.
- Usage of dma-fence expects that they complete in reasonable
amount of time.
- Compute on the other hand can be long running. Hence it is not
appropriate
- for compute contexts to export request completion dma-fence to
user.
- The dma-fence usage will be limited to in-kernel consumption only.
- Compute contexts need to use user/memory fence.
- So, long running contexts do not support output fences. Hence,
- I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
- I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are
expected
- to be not used.
- DRM_I915_GEM_WAIT ioctl call is also not supported for objects
mapped
- to long running contexts.
- */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2)
+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f
+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE
- DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)
+#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+/**
- struct drm_i915_gem_vm_bind - VA to object mapping to bind.
- This structure is passed to VM_BIND ioctl and specifies the
mapping of GPU
- virtual address (VA) range to the section of an object that
should be bound
- in the device page table of the specified address space (VM).
- The VA range specified must be unique (ie., not currently bound)
and can
- be mapped to whole object or a section of the object (partial
binding).
- Multiple VA mappings can be created to the same section of the
object
- (aliasing).
- */
+struct drm_i915_gem_vm_bind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id;
+ /** @handle: Object handle */ + __u32 handle;
+ /** @start: Virtual Address start to bind */ + __u64 start;
+ /** @offset: Offset in object to bind */ + __u64 offset;
+ /** @length: Length of mapping to bind */ + __u64 length;
Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the remainder of the space to a dummy object? In which case would there be any alignment/padding issues preventing the two bind to be placed next to each other?
I ask because someone from the compute side asked me about a problem with their strategy of dealing with overfetch and I suggested pad to size.
Thanks Tvrtko, I think we shouldn't be needing it. As with VM_BIND VA assignment is completely pushed to userspace, no padding should be necessary once the 'start' and 'size' alignment conditions are met.
I will add some documentation on alignment requirement here. Generally, 'start' and 'size' should be 4K aligned. But, I think when we have 64K lmem page sizes (dg2 and xehpsdv), they need to be 64K aligned.
- Matt
Align to 64k is enough for all overfetch issues?
Apparently compute has a situation where a buffer is received by one component and another has to apply more alignment to it, to deal with overfetch. Since they cannot grow the actual BO if they wanted to VM_BIND a scratch area on top? Or perhaps none of this is a problem on discrete and original BO should be correctly allocated to start with.
Side question - what about the align to 2MiB mentioned in i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to discrete?
Not sure about the overfetch thing, but yeah dg2 & xehpsdv both require a minimum of 64K pages underneath for local memory, and the BO size will also be rounded up accordingly. And yeah the complication arises due to not being able to mix 4K + 64K GTT pages within the same page-table (existed since even gen8). Note that 4K here is what we typically get for system memory.
Originally we had a memory coloring scheme to track the "color" of each page-table, which basically ensures that userspace can't do something nasty like mixing page sizes. The advantage of that scheme is that we would only require 64K GTT alignment and no extra padding, but is perhaps a little complex.
The merged solution is just to align and pad (i.e vma->node.size and not vma->size) out of the vma to 2M, which is dead simple implementation wise, but does potentially waste some GTT space and some of the local memory used for the actual page-table. For the alignment the kernel just validates that the GTT address is aligned to 2M in vma_insert(), and then for the padding it just inflates it to 2M, if userspace hasn't already.
See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_cr...
Regards,
Tvrtko
Niranjana
Regards,
Tvrtko
+ /** + * @flags: Supported flags are, + * + * I915_GEM_VM_BIND_READONLY: + * Mapping is read-only. + * + * I915_GEM_VM_BIND_CAPTURE: + * Capture this mapping in the dump upon GPU error. + */ + __u64 flags; +#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1)
+ /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +};
+/**
- struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
- This structure is passed to VM_UNBIND ioctl and specifies the
GPU virtual
- address (VA) range that should be unbound from the device page
table of the
- specified address space (VM). The specified VA range must match
one of the
- mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
- completion.
- */
+struct drm_i915_gem_vm_unbind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id;
+ /** @rsvd: Reserved for future use; must be zero. */ + __u32 rsvd;
+ /** @start: Virtual Address start to unbind */ + __u64 start;
+ /** @length: Length of mapping to unbind */ + __u64 length;
+ /** @flags: reserved for future usage, currently MBZ */ + __u64 flags;
+ /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +};
+/**
- struct drm_i915_vm_bind_fence - An input or output fence for the
vm_bind
- or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for input fence
to signal
- before starting the binding or unbinding.
- The vm_bind or vm_unbind async worker will signal the returned
output fence
- after the completion of binding or unbinding.
- */
+struct drm_i915_vm_bind_fence { + /** @handle: User's handle for a drm_syncobj to wait on or signal. */ + __u32 handle;
+ /** + * @flags: Supported flags are, + * + * I915_VM_BIND_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};
+/**
- struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences
for vm_bind
- and vm_unbind.
- This structure describes an array of timeline drm_syncobj and
associated
- points for timeline variants of drm_syncobj. These timeline
'drm_syncobj's
- can be input or output fences (See struct drm_i915_vm_bind_fence).
- */
+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** + * @fence_count: Number of elements in the @handles_ptr & @value_ptr + * arrays. + */ + __u64 fence_count;
+ /** + * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence + * of length @fence_count. + */ + __u64 handles_ptr;
+ /** + * @values_ptr: Pointer to an array of u64 values of length + * @fence_count. + * Values must be 0 for a binary drm_syncobj. A Value of 0 for a + * timeline drm_syncobj is invalid as it turns a drm_syncobj into a + * binary one. + */ + __u64 values_ptr; +};
+/**
- struct drm_i915_vm_bind_user_fence - An input or output user
fence for the
- vm_bind or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for the input
fence (value at
- @addr to become equal to @val) before starting the binding or
unbinding.
- The vm_bind or vm_unbind async worker will signal the output
fence after
- the completion of binding or unbinding by writing @val to memory
location at
- @addr
- */
+struct drm_i915_vm_bind_user_fence { + /** @addr: User/Memory fence qword aligned process virtual address */ + __u64 addr;
+ /** @val: User/Memory fence value to be written after bind completion */ + __u64 val;
+ /** + * @flags: Supported flags are, + * + * I915_VM_BIND_USER_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_USER_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ + (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) +};
+/**
- struct drm_i915_vm_bind_ext_user_fence - User/memory fences for
vm_bind
- and vm_unbind.
- These user fences can be input or output fences
- (See struct drm_i915_vm_bind_user_fence).
- */
+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** @fence_count: Number of elements in the @user_fence_ptr array. */ + __u64 fence_count;
+ /** + * @user_fence_ptr: Pointer to an array of + * struct drm_i915_vm_bind_user_fence of length @fence_count. + */ + __u64 user_fence_ptr; +};
+/**
- struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of
batch buffer
- gpu virtual addresses.
- In the execbuff ioctl (See struct drm_i915_gem_execbuffer2),
this extension
- must always be appended in the VM_BIND mode and it will be an
error to
- append this extension in older non-VM_BIND mode.
- */
+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** @count: Number of addresses in the addr array. */ + __u32 count;
+ /** @addr: An array of batch gpu virtual addresses. */ + __u64 addr[0]; +};
+/**
- struct drm_i915_gem_execbuffer_ext_user_fence - First level
batch completion
- signaling extension.
- This extension allows user to attach a user fence (@addr, @value
pair) to an
- execbuf to be signaled by the command streamer after the
completion of first
- level batch, by writing the @value at specified @addr and
triggering an
- interrupt.
- User can either poll for this user fence to signal or can also
wait on it
- with i915_gem_wait_user_fence ioctl.
- This is very much usefaul for long running contexts where
waiting on dma-fence
- by user (like i915_gem_wait ioctl) is not supported.
- */
+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** + * @addr: User/Memory fence qword aligned GPU virtual address. + * + * Address has to be a valid GPU virtual address at the time of + * first level batch completion. + */ + __u64 addr;
+ /** + * @value: User/Memory fence Value to be written to above address + * after first level batch completes. + */ + __u64 value;
+ /** @rsvd: Reserved for future extensions, MBZ */ + __u64 rsvd; +};
+/**
- struct drm_i915_gem_create_ext_vm_private - Extension to make
the object
- private to the specified VM.
- See struct drm_i915_gem_create_ext.
- */
+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** @vm_id: Id of the VM to which the object is private */ + __u32 vm_id; +};
+/**
- struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
- User/Memory fence can be woken up either by:
- GPU context indicated by @ctx_id, or,
- Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
- * @ctx_id is ignored when this flag is set.
- Wakeup condition is,
- ``((*addr & mask) op (value & mask))``
- See :ref:`Documentation/driver-api/dma-buf.rst
<indefinite_dma_fences>`
- */
+struct drm_i915_gem_wait_user_fence { + /** @extensions: Zero-terminated chain of extensions. */ + __u64 extensions;
+ /** @addr: User/Memory fence address */ + __u64 addr;
+ /** @ctx_id: Id of the Context which will signal the fence. */ + __u32 ctx_id;
+ /** @op: Wakeup condition operator */ + __u16 op; +#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7
+ /** + * @flags: Supported flags are, + * + * I915_UFENCE_WAIT_SOFT: + * + * To be woken up by i915 driver async worker (not by GPU). + * + * I915_UFENCE_WAIT_ABSTIME: + * + * Wait timeout specified as absolute time. + */ + __u16 flags; +#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2
+ /** @value: Wakeup value */ + __u64 value;
+ /** @mask: Wakeup mask */ + __u64 mask; +#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull
+ /** + * @timeout: Wait timeout in nanoseconds. + * + * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the + * absolute time in nsec. + */ + __s64 timeout; +};
On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:
On 08/06/2022 08:17, Tvrtko Ursulin wrote:
On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
VM_BIND and related uapi definitions
v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura
niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/**
- DOC: I915_PARAM_HAS_VM_BIND
- VM_BIND feature availability.
- See typedef drm_i915_getparam_t param.
- */
+#define I915_PARAM_HAS_VM_BIND 57
+/**
- DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
- Flag to opt-in for VM_BIND mode of binding during VM creation.
- See struct drm_i915_gem_vm_control flags.
- A VM in VM_BIND mode will not support the older execbuff
mode of binding.
- In VM_BIND mode, execbuff ioctl will not accept any
execlist (ie., the
- &drm_i915_gem_execbuffer2.buffer_count must be 0).
- Also, &drm_i915_gem_execbuffer2.batch_start_offset and
- &drm_i915_gem_execbuffer2.batch_len must be 0.
- DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must
be provided
- to pass in the batch buffer addresses.
- Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
- I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
- (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag
must always be
- set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
- The buffers_ptr, buffer_count, batch_start_offset and
batch_len fields
- of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
- */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0)
+/**
- DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
- Flag to declare context as long running.
- See struct drm_i915_gem_context_create_ext flags.
- Usage of dma-fence expects that they complete in
reasonable amount of time.
- Compute on the other hand can be long running. Hence it is
not appropriate
- for compute contexts to export request completion
dma-fence to user.
- The dma-fence usage will be limited to in-kernel consumption only.
- Compute contexts need to use user/memory fence.
- So, long running contexts do not support output fences. Hence,
- I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
- I915_EXEC_FENCE_SIGNAL (See
&drm_i915_gem_exec_fence.flags) are expected
- to be not used.
- DRM_I915_GEM_WAIT ioctl call is also not supported for
objects mapped
- to long running contexts.
- */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2)
+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f
+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+/**
- struct drm_i915_gem_vm_bind - VA to object mapping to bind.
- This structure is passed to VM_BIND ioctl and specifies
the mapping of GPU
- virtual address (VA) range to the section of an object
that should be bound
- in the device page table of the specified address space (VM).
- The VA range specified must be unique (ie., not currently
bound) and can
- be mapped to whole object or a section of the object
(partial binding).
- Multiple VA mappings can be created to the same section of
the object
- (aliasing).
- */
+struct drm_i915_gem_vm_bind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id;
+ /** @handle: Object handle */ + __u32 handle;
+ /** @start: Virtual Address start to bind */ + __u64 start;
+ /** @offset: Offset in object to bind */ + __u64 offset;
+ /** @length: Length of mapping to bind */ + __u64 length;
Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the remainder of the space to a dummy object? In which case would there be any alignment/padding issues preventing the two bind to be placed next to each other?
I ask because someone from the compute side asked me about a problem with their strategy of dealing with overfetch and I suggested pad to size.
Thanks Tvrtko, I think we shouldn't be needing it. As with VM_BIND VA assignment is completely pushed to userspace, no padding should be necessary once the 'start' and 'size' alignment conditions are met.
I will add some documentation on alignment requirement here. Generally, 'start' and 'size' should be 4K aligned. But, I think when we have 64K lmem page sizes (dg2 and xehpsdv), they need to be 64K aligned.
- Matt
Align to 64k is enough for all overfetch issues?
Apparently compute has a situation where a buffer is received by one component and another has to apply more alignment to it, to deal with overfetch. Since they cannot grow the actual BO if they wanted to VM_BIND a scratch area on top? Or perhaps none of this is a problem on discrete and original BO should be correctly allocated to start with.
Side question - what about the align to 2MiB mentioned in i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to discrete?
Not sure about the overfetch thing, but yeah dg2 & xehpsdv both require a minimum of 64K pages underneath for local memory, and the BO size will also be rounded up accordingly. And yeah the complication arises due to not being able to mix 4K + 64K GTT pages within the same page-table (existed since even gen8). Note that 4K here is what we typically get for system memory.
Originally we had a memory coloring scheme to track the "color" of each page-table, which basically ensures that userspace can't do something nasty like mixing page sizes. The advantage of that scheme is that we would only require 64K GTT alignment and no extra padding, but is perhaps a little complex.
The merged solution is just to align and pad (i.e vma->node.size and not vma->size) out of the vma to 2M, which is dead simple implementation wise, but does potentially waste some GTT space and some of the local memory used for the actual page-table. For the alignment the kernel just validates that the GTT address is aligned to 2M in vma_insert(), and then for the padding it just inflates it to 2M, if userspace hasn't already.
See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_cr...
Ok, those requirements (2M VA alignment) will apply to VM_BIND also. This is unfortunate, but it is not something new enforced by VM_BIND. Other option is to go with 64K alignment and in VM_BIND case, user must ensure there is no mix-matching of 64K (lmem) and 4k (smem) mappings in the same 2M range. But this is not VM_BIND specific (will apply to soft-pinning in execbuf2 also).
I don't think we need any VA padding here as with VM_BIND VA is managed fully by the user. If we enforce VA to be 2M aligned, it will leave holes (if BOs are smaller then 2M), but nobody is going to allocate anything form there.
Niranjana
Regards,
Tvrtko
Niranjana
Regards,
Tvrtko
+ /** + * @flags: Supported flags are, + * + * I915_GEM_VM_BIND_READONLY: + * Mapping is read-only. + * + * I915_GEM_VM_BIND_CAPTURE: + * Capture this mapping in the dump upon GPU error. + */ + __u64 flags; +#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1)
+ /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +};
+/**
- struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
- This structure is passed to VM_UNBIND ioctl and specifies
the GPU virtual
- address (VA) range that should be unbound from the device
page table of the
- specified address space (VM). The specified VA range must
match one of the
- mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
- completion.
- */
+struct drm_i915_gem_vm_unbind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id;
+ /** @rsvd: Reserved for future use; must be zero. */ + __u32 rsvd;
+ /** @start: Virtual Address start to unbind */ + __u64 start;
+ /** @length: Length of mapping to unbind */ + __u64 length;
+ /** @flags: reserved for future usage, currently MBZ */ + __u64 flags;
+ /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +};
+/**
- struct drm_i915_vm_bind_fence - An input or output fence
for the vm_bind
- or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for input
fence to signal
- before starting the binding or unbinding.
- The vm_bind or vm_unbind async worker will signal the
returned output fence
- after the completion of binding or unbinding.
- */
+struct drm_i915_vm_bind_fence { + /** @handle: User's handle for a drm_syncobj to wait on or signal. */ + __u32 handle;
+ /** + * @flags: Supported flags are, + * + * I915_VM_BIND_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};
+/**
- struct drm_i915_vm_bind_ext_timeline_fences - Timeline
fences for vm_bind
- and vm_unbind.
- This structure describes an array of timeline drm_syncobj
and associated
- points for timeline variants of drm_syncobj. These
timeline 'drm_syncobj's
- can be input or output fences (See struct drm_i915_vm_bind_fence).
- */
+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** + * @fence_count: Number of elements in the @handles_ptr & @value_ptr + * arrays. + */ + __u64 fence_count;
+ /** + * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence + * of length @fence_count. + */ + __u64 handles_ptr;
+ /** + * @values_ptr: Pointer to an array of u64 values of length + * @fence_count. + * Values must be 0 for a binary drm_syncobj. A Value of 0 for a + * timeline drm_syncobj is invalid as it turns a drm_syncobj into a + * binary one. + */ + __u64 values_ptr; +};
+/**
- struct drm_i915_vm_bind_user_fence - An input or output
user fence for the
- vm_bind or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for the
input fence (value at
- @addr to become equal to @val) before starting the binding
or unbinding.
- The vm_bind or vm_unbind async worker will signal the
output fence after
- the completion of binding or unbinding by writing @val to
memory location at
- @addr
- */
+struct drm_i915_vm_bind_user_fence { + /** @addr: User/Memory fence qword aligned process virtual address */ + __u64 addr;
+ /** @val: User/Memory fence value to be written after bind completion */ + __u64 val;
+ /** + * @flags: Supported flags are, + * + * I915_VM_BIND_USER_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_USER_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ + (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) +};
+/**
- struct drm_i915_vm_bind_ext_user_fence - User/memory
fences for vm_bind
- and vm_unbind.
- These user fences can be input or output fences
- (See struct drm_i915_vm_bind_user_fence).
- */
+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** @fence_count: Number of elements in the @user_fence_ptr array. */ + __u64 fence_count;
+ /** + * @user_fence_ptr: Pointer to an array of + * struct drm_i915_vm_bind_user_fence of length @fence_count. + */ + __u64 user_fence_ptr; +};
+/**
- struct drm_i915_gem_execbuffer_ext_batch_addresses - Array
of batch buffer
- gpu virtual addresses.
- In the execbuff ioctl (See struct
drm_i915_gem_execbuffer2), this extension
- must always be appended in the VM_BIND mode and it will be
an error to
- append this extension in older non-VM_BIND mode.
- */
+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** @count: Number of addresses in the addr array. */ + __u32 count;
+ /** @addr: An array of batch gpu virtual addresses. */ + __u64 addr[0]; +};
+/**
- struct drm_i915_gem_execbuffer_ext_user_fence - First
level batch completion
- signaling extension.
- This extension allows user to attach a user fence (@addr,
@value pair) to an
- execbuf to be signaled by the command streamer after the
completion of first
- level batch, by writing the @value at specified @addr and
triggering an
- interrupt.
- User can either poll for this user fence to signal or can
also wait on it
- with i915_gem_wait_user_fence ioctl.
- This is very much usefaul for long running contexts where
waiting on dma-fence
- by user (like i915_gem_wait ioctl) is not supported.
- */
+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** + * @addr: User/Memory fence qword aligned GPU virtual address. + * + * Address has to be a valid GPU virtual address at the time of + * first level batch completion. + */ + __u64 addr;
+ /** + * @value: User/Memory fence Value to be written to above address + * after first level batch completes. + */ + __u64 value;
+ /** @rsvd: Reserved for future extensions, MBZ */ + __u64 rsvd; +};
+/**
- struct drm_i915_gem_create_ext_vm_private - Extension to
make the object
- private to the specified VM.
- See struct drm_i915_gem_create_ext.
- */
+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** @vm_id: Id of the VM to which the object is private */ + __u32 vm_id; +};
+/**
- struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
- User/Memory fence can be woken up either by:
- GPU context indicated by @ctx_id, or,
- Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
- * @ctx_id is ignored when this flag is set.
- Wakeup condition is,
- ``((*addr & mask) op (value & mask))``
- See :ref:`Documentation/driver-api/dma-buf.rst
<indefinite_dma_fences>`
- */
+struct drm_i915_gem_wait_user_fence { + /** @extensions: Zero-terminated chain of extensions. */ + __u64 extensions;
+ /** @addr: User/Memory fence address */ + __u64 addr;
+ /** @ctx_id: Id of the Context which will signal the fence. */ + __u32 ctx_id;
+ /** @op: Wakeup condition operator */ + __u16 op; +#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7
+ /** + * @flags: Supported flags are, + * + * I915_UFENCE_WAIT_SOFT: + * + * To be woken up by i915 driver async worker (not by GPU). + * + * I915_UFENCE_WAIT_ABSTIME: + * + * Wait timeout specified as absolute time. + */ + __u16 flags; +#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2
+ /** @value: Wakeup value */ + __u64 value;
+ /** @mask: Wakeup mask */ + __u64 mask; +#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull
+ /** + * @timeout: Wait timeout in nanoseconds. + * + * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the + * absolute time in nsec. + */ + __s64 timeout; +};
On 08/06/2022 22:32, Niranjana Vishwanathapura wrote:
On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:
On 08/06/2022 08:17, Tvrtko Ursulin wrote:
On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:
VM_BIND and related uapi definitions
v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura
niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/**
- DOC: I915_PARAM_HAS_VM_BIND
- VM_BIND feature availability.
- See typedef drm_i915_getparam_t param.
- */
+#define I915_PARAM_HAS_VM_BIND 57
+/**
- DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
- Flag to opt-in for VM_BIND mode of binding during VM creation.
- See struct drm_i915_gem_vm_control flags.
- A VM in VM_BIND mode will not support the older execbuff mode
of binding.
- In VM_BIND mode, execbuff ioctl will not accept any execlist
(ie., the
- &drm_i915_gem_execbuffer2.buffer_count must be 0).
- Also, &drm_i915_gem_execbuffer2.batch_start_offset and
- &drm_i915_gem_execbuffer2.batch_len must be 0.
- DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be
provided
- to pass in the batch buffer addresses.
- Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
- I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must
be 0
- (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must
always be
- set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
- The buffers_ptr, buffer_count, batch_start_offset and
batch_len fields
- of struct drm_i915_gem_execbuffer2 are also not used and must
be 0.
- */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0)
+/**
- DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
- Flag to declare context as long running.
- See struct drm_i915_gem_context_create_ext flags.
- Usage of dma-fence expects that they complete in reasonable
amount of time.
- Compute on the other hand can be long running. Hence it is not
appropriate
- for compute contexts to export request completion dma-fence to
user.
- The dma-fence usage will be limited to in-kernel consumption
only.
- Compute contexts need to use user/memory fence.
- So, long running contexts do not support output fences. Hence,
- I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
- I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags)
are expected
- to be not used.
- DRM_I915_GEM_WAIT ioctl call is also not supported for objects
mapped
- to long running contexts.
- */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2)
+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f
+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+/**
- struct drm_i915_gem_vm_bind - VA to object mapping to bind.
- This structure is passed to VM_BIND ioctl and specifies the
mapping of GPU
- virtual address (VA) range to the section of an object that
should be bound
- in the device page table of the specified address space (VM).
- The VA range specified must be unique (ie., not currently
bound) and can
- be mapped to whole object or a section of the object (partial
binding).
- Multiple VA mappings can be created to the same section of the
object
- (aliasing).
- */
+struct drm_i915_gem_vm_bind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id;
+ /** @handle: Object handle */ + __u32 handle;
+ /** @start: Virtual Address start to bind */ + __u64 start;
+ /** @offset: Offset in object to bind */ + __u64 offset;
+ /** @length: Length of mapping to bind */ + __u64 length;
Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the remainder of the space to a dummy object? In which case would there be any alignment/padding issues preventing the two bind to be placed next to each other?
I ask because someone from the compute side asked me about a problem with their strategy of dealing with overfetch and I suggested pad to size.
Thanks Tvrtko, I think we shouldn't be needing it. As with VM_BIND VA assignment is completely pushed to userspace, no padding should be necessary once the 'start' and 'size' alignment conditions are met.
I will add some documentation on alignment requirement here. Generally, 'start' and 'size' should be 4K aligned. But, I think when we have 64K lmem page sizes (dg2 and xehpsdv), they need to be 64K aligned.
- Matt
Align to 64k is enough for all overfetch issues?
Apparently compute has a situation where a buffer is received by one component and another has to apply more alignment to it, to deal with overfetch. Since they cannot grow the actual BO if they wanted to VM_BIND a scratch area on top? Or perhaps none of this is a problem on discrete and original BO should be correctly allocated to start with.
Side question - what about the align to 2MiB mentioned in i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to discrete?
Not sure about the overfetch thing, but yeah dg2 & xehpsdv both require a minimum of 64K pages underneath for local memory, and the BO size will also be rounded up accordingly. And yeah the complication arises due to not being able to mix 4K + 64K GTT pages within the same page-table (existed since even gen8). Note that 4K here is what we typically get for system memory.
Originally we had a memory coloring scheme to track the "color" of each page-table, which basically ensures that userspace can't do something nasty like mixing page sizes. The advantage of that scheme is that we would only require 64K GTT alignment and no extra padding, but is perhaps a little complex.
The merged solution is just to align and pad (i.e vma->node.size and not vma->size) out of the vma to 2M, which is dead simple implementation wise, but does potentially waste some GTT space and some of the local memory used for the actual page-table. For the alignment the kernel just validates that the GTT address is aligned to 2M in vma_insert(), and then for the padding it just inflates it to 2M, if userspace hasn't already.
See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_cr...
Ok, those requirements (2M VA alignment) will apply to VM_BIND also. This is unfortunate, but it is not something new enforced by VM_BIND. Other option is to go with 64K alignment and in VM_BIND case, user must ensure there is no mix-matching of 64K (lmem) and 4k (smem) mappings in the same 2M range. But this is not VM_BIND specific (will apply to soft-pinning in execbuf2 also).
I don't think we need any VA padding here as with VM_BIND VA is managed fully by the user. If we enforce VA to be 2M aligned, it will leave holes (if BOs are smaller then 2M), but nobody is going to allocate anything form there.
Note that we only apply the 2M alignment + padding for local memory pages, for system memory we don't have/need such restrictions. The VA padding then importantly prevents userspace from incorrectly (or maliciously) inserting 4K system memory object in some page-table operating in 64K GTT mode.
Niranjana
Regards,
Tvrtko
Niranjana
Regards,
Tvrtko
+ /** + * @flags: Supported flags are, + * + * I915_GEM_VM_BIND_READONLY: + * Mapping is read-only. + * + * I915_GEM_VM_BIND_CAPTURE: + * Capture this mapping in the dump upon GPU error. + */ + __u64 flags; +#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1)
+ /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +};
+/**
- struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
- This structure is passed to VM_UNBIND ioctl and specifies the
GPU virtual
- address (VA) range that should be unbound from the device page
table of the
- specified address space (VM). The specified VA range must
match one of the
- mappings created with the VM_BIND ioctl. TLB is flushed upon
unbind
- completion.
- */
+struct drm_i915_gem_vm_unbind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id;
+ /** @rsvd: Reserved for future use; must be zero. */ + __u32 rsvd;
+ /** @start: Virtual Address start to unbind */ + __u64 start;
+ /** @length: Length of mapping to unbind */ + __u64 length;
+ /** @flags: reserved for future usage, currently MBZ */ + __u64 flags;
+ /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +};
+/**
- struct drm_i915_vm_bind_fence - An input or output fence for
the vm_bind
- or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for input fence
to signal
- before starting the binding or unbinding.
- The vm_bind or vm_unbind async worker will signal the returned
output fence
- after the completion of binding or unbinding.
- */
+struct drm_i915_vm_bind_fence { + /** @handle: User's handle for a drm_syncobj to wait on or signal. */ + __u32 handle;
+ /** + * @flags: Supported flags are, + * + * I915_VM_BIND_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};
+/**
- struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences
for vm_bind
- and vm_unbind.
- This structure describes an array of timeline drm_syncobj and
associated
- points for timeline variants of drm_syncobj. These timeline
'drm_syncobj's
- can be input or output fences (See struct
drm_i915_vm_bind_fence).
- */
+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** + * @fence_count: Number of elements in the @handles_ptr & @value_ptr + * arrays. + */ + __u64 fence_count;
+ /** + * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence + * of length @fence_count. + */ + __u64 handles_ptr;
+ /** + * @values_ptr: Pointer to an array of u64 values of length + * @fence_count. + * Values must be 0 for a binary drm_syncobj. A Value of 0 for a + * timeline drm_syncobj is invalid as it turns a drm_syncobj into a + * binary one. + */ + __u64 values_ptr; +};
+/**
- struct drm_i915_vm_bind_user_fence - An input or output user
fence for the
- vm_bind or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for the input
fence (value at
- @addr to become equal to @val) before starting the binding or
unbinding.
- The vm_bind or vm_unbind async worker will signal the output
fence after
- the completion of binding or unbinding by writing @val to
memory location at
- @addr
- */
+struct drm_i915_vm_bind_user_fence { + /** @addr: User/Memory fence qword aligned process virtual address */ + __u64 addr;
+ /** @val: User/Memory fence value to be written after bind completion */ + __u64 val;
+ /** + * @flags: Supported flags are, + * + * I915_VM_BIND_USER_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_USER_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ + (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) +};
+/**
- struct drm_i915_vm_bind_ext_user_fence - User/memory fences
for vm_bind
- and vm_unbind.
- These user fences can be input or output fences
- (See struct drm_i915_vm_bind_user_fence).
- */
+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** @fence_count: Number of elements in the @user_fence_ptr array. */ + __u64 fence_count;
+ /** + * @user_fence_ptr: Pointer to an array of + * struct drm_i915_vm_bind_user_fence of length @fence_count. + */ + __u64 user_fence_ptr; +};
+/**
- struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of
batch buffer
- gpu virtual addresses.
- In the execbuff ioctl (See struct drm_i915_gem_execbuffer2),
this extension
- must always be appended in the VM_BIND mode and it will be an
error to
- append this extension in older non-VM_BIND mode.
- */
+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** @count: Number of addresses in the addr array. */ + __u32 count;
+ /** @addr: An array of batch gpu virtual addresses. */ + __u64 addr[0]; +};
+/**
- struct drm_i915_gem_execbuffer_ext_user_fence - First level
batch completion
- signaling extension.
- This extension allows user to attach a user fence (@addr,
@value pair) to an
- execbuf to be signaled by the command streamer after the
completion of first
- level batch, by writing the @value at specified @addr and
triggering an
- interrupt.
- User can either poll for this user fence to signal or can also
wait on it
- with i915_gem_wait_user_fence ioctl.
- This is very much usefaul for long running contexts where
waiting on dma-fence
- by user (like i915_gem_wait ioctl) is not supported.
- */
+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** + * @addr: User/Memory fence qword aligned GPU virtual address. + * + * Address has to be a valid GPU virtual address at the time of + * first level batch completion. + */ + __u64 addr;
+ /** + * @value: User/Memory fence Value to be written to above address + * after first level batch completes. + */ + __u64 value;
+ /** @rsvd: Reserved for future extensions, MBZ */ + __u64 rsvd; +};
+/**
- struct drm_i915_gem_create_ext_vm_private - Extension to make
the object
- private to the specified VM.
- See struct drm_i915_gem_create_ext.
- */
+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base;
+ /** @vm_id: Id of the VM to which the object is private */ + __u32 vm_id; +};
+/**
- struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
- User/Memory fence can be woken up either by:
- GPU context indicated by @ctx_id, or,
- Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
- * @ctx_id is ignored when this flag is set.
- Wakeup condition is,
- ``((*addr & mask) op (value & mask))``
- See :ref:`Documentation/driver-api/dma-buf.rst
<indefinite_dma_fences>`
- */
+struct drm_i915_gem_wait_user_fence { + /** @extensions: Zero-terminated chain of extensions. */ + __u64 extensions;
+ /** @addr: User/Memory fence address */ + __u64 addr;
+ /** @ctx_id: Id of the Context which will signal the fence. */ + __u32 ctx_id;
+ /** @op: Wakeup condition operator */ + __u16 op; +#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7
+ /** + * @flags: Supported flags are, + * + * I915_UFENCE_WAIT_SOFT: + * + * To be woken up by i915 driver async worker (not by GPU). + * + * I915_UFENCE_WAIT_ABSTIME: + * + * Wait timeout specified as absolute time. + */ + __u16 flags; +#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2
+ /** @value: Wakeup value */ + __u64 value;
+ /** @mask: Wakeup mask */ + __u64 mask; +#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull
+ /** + * @timeout: Wait timeout in nanoseconds. + * + * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the + * absolute time in nsec. + */ + __s64 timeout; +};
On Thu, Jun 09, 2022 at 09:36:48AM +0100, Matthew Auld wrote:
On 08/06/2022 22:32, Niranjana Vishwanathapura wrote:
On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:
On 08/06/2022 08:17, Tvrtko Ursulin wrote:
On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:
On 17/05/2022 19:32, Niranjana Vishwanathapura wrote: >VM_BIND and related uapi definitions > >v2: Ensure proper kernel-doc formatting with cross references. > Also add new uapi and documentation as per review comments > from Daniel. > >Signed-off-by: Niranjana Vishwanathapura >niranjana.vishwanathapura@intel.com >--- > Documentation/gpu/rfc/i915_vm_bind.h | 399 >+++++++++++++++++++++++++++ > 1 file changed, 399 insertions(+) > create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h > >diff --git a/Documentation/gpu/rfc/i915_vm_bind.h >b/Documentation/gpu/rfc/i915_vm_bind.h >new file mode 100644 >index 000000000000..589c0a009107 >--- /dev/null >+++ b/Documentation/gpu/rfc/i915_vm_bind.h >@@ -0,0 +1,399 @@ >+/* SPDX-License-Identifier: MIT */ >+/* >+ * Copyright © 2022 Intel Corporation >+ */ >+ >+/** >+ * DOC: I915_PARAM_HAS_VM_BIND >+ * >+ * VM_BIND feature availability. >+ * See typedef drm_i915_getparam_t param. >+ */ >+#define I915_PARAM_HAS_VM_BIND 57 >+ >+/** >+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >+ * >+ * Flag to opt-in for VM_BIND mode of binding during VM creation. >+ * See struct drm_i915_gem_vm_control flags. >+ * >+ * A VM in VM_BIND mode will not support the older >execbuff mode of binding. >+ * In VM_BIND mode, execbuff ioctl will not accept any >execlist (ie., the >+ * &drm_i915_gem_execbuffer2.buffer_count must be 0). >+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >+ * &drm_i915_gem_execbuffer2.batch_len must be 0. >+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension >must be provided >+ * to pass in the batch buffer addresses. >+ * >+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >+ * I915_EXEC_BATCH_FIRST of >&drm_i915_gem_execbuffer2.flags must be 0 >+ * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS >flag must always be >+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >+ * The buffers_ptr, buffer_count, batch_start_offset and >batch_len fields >+ * of struct drm_i915_gem_execbuffer2 are also not used >and must be 0. >+ */ >+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0) >+ >+/** >+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING >+ * >+ * Flag to declare context as long running. >+ * See struct drm_i915_gem_context_create_ext flags. >+ * >+ * Usage of dma-fence expects that they complete in >reasonable amount of time. >+ * Compute on the other hand can be long running. Hence >it is not appropriate >+ * for compute contexts to export request completion >dma-fence to user. >+ * The dma-fence usage will be limited to in-kernel >consumption only. >+ * Compute contexts need to use user/memory fence. >+ * >+ * So, long running contexts do not support output fences. Hence, >+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and >+ * I915_EXEC_FENCE_SIGNAL (See >&drm_i915_gem_exec_fence.flags) are expected >+ * to be not used. >+ * >+ * DRM_I915_GEM_WAIT ioctl call is also not supported for >objects mapped >+ * to long running contexts. >+ */ >+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2) >+ >+/* VM_BIND related ioctls */ >+#define DRM_I915_GEM_VM_BIND 0x3d >+#define DRM_I915_GEM_VM_UNBIND 0x3e >+#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f >+ >+#define DRM_IOCTL_I915_GEM_VM_BIND >DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct >drm_i915_gem_vm_bind) >+#define DRM_IOCTL_I915_GEM_VM_UNBIND >DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct >drm_i915_gem_vm_bind) >+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE >DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, >struct drm_i915_gem_wait_user_fence) >+ >+/** >+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind. >+ * >+ * This structure is passed to VM_BIND ioctl and >specifies the mapping of GPU >+ * virtual address (VA) range to the section of an object >that should be bound >+ * in the device page table of the specified address space (VM). >+ * The VA range specified must be unique (ie., not >currently bound) and can >+ * be mapped to whole object or a section of the object >(partial binding). >+ * Multiple VA mappings can be created to the same >section of the object >+ * (aliasing). >+ */ >+struct drm_i915_gem_vm_bind { >+ /** @vm_id: VM (address space) id to bind */ >+ __u32 vm_id; >+ >+ /** @handle: Object handle */ >+ __u32 handle; >+ >+ /** @start: Virtual Address start to bind */ >+ __u64 start; >+ >+ /** @offset: Offset in object to bind */ >+ __u64 offset; >+ >+ /** @length: Length of mapping to bind */ >+ __u64 length;
Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the remainder of the space to a dummy object? In which case would there be any alignment/padding issues preventing the two bind to be placed next to each other?
I ask because someone from the compute side asked me about a problem with their strategy of dealing with overfetch and I suggested pad to size.
Thanks Tvrtko, I think we shouldn't be needing it. As with VM_BIND VA assignment is completely pushed to userspace, no padding should be necessary once the 'start' and 'size' alignment conditions are met.
I will add some documentation on alignment requirement here. Generally, 'start' and 'size' should be 4K aligned. But, I think when we have 64K lmem page sizes (dg2 and xehpsdv), they need to be 64K aligned.
- Matt
Align to 64k is enough for all overfetch issues?
Apparently compute has a situation where a buffer is received by one component and another has to apply more alignment to it, to deal with overfetch. Since they cannot grow the actual BO if they wanted to VM_BIND a scratch area on top? Or perhaps none of this is a problem on discrete and original BO should be correctly allocated to start with.
Side question - what about the align to 2MiB mentioned in i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to discrete?
Not sure about the overfetch thing, but yeah dg2 & xehpsdv both require a minimum of 64K pages underneath for local memory, and the BO size will also be rounded up accordingly. And yeah the complication arises due to not being able to mix 4K + 64K GTT pages within the same page-table (existed since even gen8). Note that 4K here is what we typically get for system memory.
Originally we had a memory coloring scheme to track the "color" of each page-table, which basically ensures that userspace can't do something nasty like mixing page sizes. The advantage of that scheme is that we would only require 64K GTT alignment and no extra padding, but is perhaps a little complex.
The merged solution is just to align and pad (i.e vma->node.size and not vma->size) out of the vma to 2M, which is dead simple implementation wise, but does potentially waste some GTT space and some of the local memory used for the actual page-table. For the alignment the kernel just validates that the GTT address is aligned to 2M in vma_insert(), and then for the padding it just inflates it to 2M, if userspace hasn't already.
See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_cr...
Ok, those requirements (2M VA alignment) will apply to VM_BIND also. This is unfortunate, but it is not something new enforced by VM_BIND. Other option is to go with 64K alignment and in VM_BIND case, user must ensure there is no mix-matching of 64K (lmem) and 4k (smem) mappings in the same 2M range. But this is not VM_BIND specific (will apply to soft-pinning in execbuf2 also).
I don't think we need any VA padding here as with VM_BIND VA is managed fully by the user. If we enforce VA to be 2M aligned, it will leave holes (if BOs are smaller then 2M), but nobody is going to allocate anything form there.
Note that we only apply the 2M alignment + padding for local memory pages, for system memory we don't have/need such restrictions. The VA padding then importantly prevents userspace from incorrectly (or maliciously) inserting 4K system memory object in some page-table operating in 64K GTT mode.
Thanks Matt. I also, syned offline with Matt a bit on this. We don't need explicit 'pad_to_size' size. i915 driver is implicitly padding the size to 2M boundary for LMEM BOs which will apply for VM_BIND also. The remaining question is whether we enforce 2M VA alignment for lmem BOs (just like legacy execbuff path) on dg2 & xehpsdv, or go with just 64K alignment but ensure there is no mixing of 4K and 64K mappings in same 2M range. I think we can go with 2M alignment requirement for VM_BIND also. So, no new requirements here for VM_BIND.
I will update the documentation.
Niranjana
Niranjana
Regards,
Tvrtko
Niranjana
Regards,
Tvrtko
>+ >+ /** >+ * @flags: Supported flags are, >+ * >+ * I915_GEM_VM_BIND_READONLY: >+ * Mapping is read-only. >+ * >+ * I915_GEM_VM_BIND_CAPTURE: >+ * Capture this mapping in the dump upon GPU error. >+ */ >+ __u64 flags; >+#define I915_GEM_VM_BIND_READONLY (1 << 0) >+#define I915_GEM_VM_BIND_CAPTURE (1 << 1) >+ >+ /** @extensions: 0-terminated chain of extensions for >this mapping. */ >+ __u64 extensions; >+}; >+ >+/** >+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind. >+ * >+ * This structure is passed to VM_UNBIND ioctl and >specifies the GPU virtual >+ * address (VA) range that should be unbound from the >device page table of the >+ * specified address space (VM). The specified VA range >must match one of the >+ * mappings created with the VM_BIND ioctl. TLB is >flushed upon unbind >+ * completion. >+ */ >+struct drm_i915_gem_vm_unbind { >+ /** @vm_id: VM (address space) id to bind */ >+ __u32 vm_id; >+ >+ /** @rsvd: Reserved for future use; must be zero. */ >+ __u32 rsvd; >+ >+ /** @start: Virtual Address start to unbind */ >+ __u64 start; >+ >+ /** @length: Length of mapping to unbind */ >+ __u64 length; >+ >+ /** @flags: reserved for future usage, currently MBZ */ >+ __u64 flags; >+ >+ /** @extensions: 0-terminated chain of extensions for >this mapping. */ >+ __u64 extensions; >+}; >+ >+/** >+ * struct drm_i915_vm_bind_fence - An input or output >fence for the vm_bind >+ * or the vm_unbind work. >+ * >+ * The vm_bind or vm_unbind aync worker will wait for >input fence to signal >+ * before starting the binding or unbinding. >+ * >+ * The vm_bind or vm_unbind async worker will signal the >returned output fence >+ * after the completion of binding or unbinding. >+ */ >+struct drm_i915_vm_bind_fence { >+ /** @handle: User's handle for a drm_syncobj to wait >on or signal. */ >+ __u32 handle; >+ >+ /** >+ * @flags: Supported flags are, >+ * >+ * I915_VM_BIND_FENCE_WAIT: >+ * Wait for the input fence before binding/unbinding >+ * >+ * I915_VM_BIND_FENCE_SIGNAL: >+ * Return bind/unbind completion fence as output >+ */ >+ __u32 flags; >+#define I915_VM_BIND_FENCE_WAIT (1<<0) >+#define I915_VM_BIND_FENCE_SIGNAL (1<<1) >+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS >(-(I915_VM_BIND_FENCE_SIGNAL << 1)) >+}; >+ >+/** >+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline >fences for vm_bind >+ * and vm_unbind. >+ * >+ * This structure describes an array of timeline >drm_syncobj and associated >+ * points for timeline variants of drm_syncobj. These >timeline 'drm_syncobj's >+ * can be input or output fences (See struct >drm_i915_vm_bind_fence). >+ */ >+struct drm_i915_vm_bind_ext_timeline_fences { >+#define I915_VM_BIND_EXT_timeline_FENCES 0 >+ /** @base: Extension link. See struct i915_user_extension. */ >+ struct i915_user_extension base; >+ >+ /** >+ * @fence_count: Number of elements in the >@handles_ptr & @value_ptr >+ * arrays. >+ */ >+ __u64 fence_count; >+ >+ /** >+ * @handles_ptr: Pointer to an array of struct >drm_i915_vm_bind_fence >+ * of length @fence_count. >+ */ >+ __u64 handles_ptr; >+ >+ /** >+ * @values_ptr: Pointer to an array of u64 values of length >+ * @fence_count. >+ * Values must be 0 for a binary drm_syncobj. A Value of 0 for a >+ * timeline drm_syncobj is invalid as it turns a >drm_syncobj into a >+ * binary one. >+ */ >+ __u64 values_ptr; >+}; >+ >+/** >+ * struct drm_i915_vm_bind_user_fence - An input or >output user fence for the >+ * vm_bind or the vm_unbind work. >+ * >+ * The vm_bind or vm_unbind aync worker will wait for the >input fence (value at >+ * @addr to become equal to @val) before starting the >binding or unbinding. >+ * >+ * The vm_bind or vm_unbind async worker will signal the >output fence after >+ * the completion of binding or unbinding by writing @val >to memory location at >+ * @addr >+ */ >+struct drm_i915_vm_bind_user_fence { >+ /** @addr: User/Memory fence qword aligned process >virtual address */ >+ __u64 addr; >+ >+ /** @val: User/Memory fence value to be written after >bind completion */ >+ __u64 val; >+ >+ /** >+ * @flags: Supported flags are, >+ * >+ * I915_VM_BIND_USER_FENCE_WAIT: >+ * Wait for the input fence before binding/unbinding >+ * >+ * I915_VM_BIND_USER_FENCE_SIGNAL: >+ * Return bind/unbind completion fence as output >+ */ >+ __u32 flags; >+#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) >+#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) >+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ >+ (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) >+}; >+ >+/** >+ * struct drm_i915_vm_bind_ext_user_fence - User/memory >fences for vm_bind >+ * and vm_unbind. >+ * >+ * These user fences can be input or output fences >+ * (See struct drm_i915_vm_bind_user_fence). >+ */ >+struct drm_i915_vm_bind_ext_user_fence { >+#define I915_VM_BIND_EXT_USER_FENCES 1 >+ /** @base: Extension link. See struct i915_user_extension. */ >+ struct i915_user_extension base; >+ >+ /** @fence_count: Number of elements in the >@user_fence_ptr array. */ >+ __u64 fence_count; >+ >+ /** >+ * @user_fence_ptr: Pointer to an array of >+ * struct drm_i915_vm_bind_user_fence of length @fence_count. >+ */ >+ __u64 user_fence_ptr; >+}; >+ >+/** >+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - >Array of batch buffer >+ * gpu virtual addresses. >+ * >+ * In the execbuff ioctl (See struct >drm_i915_gem_execbuffer2), this extension >+ * must always be appended in the VM_BIND mode and it >will be an error to >+ * append this extension in older non-VM_BIND mode. >+ */ >+struct drm_i915_gem_execbuffer_ext_batch_addresses { >+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1 >+ /** @base: Extension link. See struct i915_user_extension. */ >+ struct i915_user_extension base; >+ >+ /** @count: Number of addresses in the addr array. */ >+ __u32 count; >+ >+ /** @addr: An array of batch gpu virtual addresses. */ >+ __u64 addr[0]; >+}; >+ >+/** >+ * struct drm_i915_gem_execbuffer_ext_user_fence - First >level batch completion >+ * signaling extension. >+ * >+ * This extension allows user to attach a user fence >(@addr, @value pair) to an >+ * execbuf to be signaled by the command streamer after >the completion of first >+ * level batch, by writing the @value at specified @addr >and triggering an >+ * interrupt. >+ * User can either poll for this user fence to signal or >can also wait on it >+ * with i915_gem_wait_user_fence ioctl. >+ * This is very much usefaul for long running contexts >where waiting on dma-fence >+ * by user (like i915_gem_wait ioctl) is not supported. >+ */ >+struct drm_i915_gem_execbuffer_ext_user_fence { >+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2 >+ /** @base: Extension link. See struct i915_user_extension. */ >+ struct i915_user_extension base; >+ >+ /** >+ * @addr: User/Memory fence qword aligned GPU virtual address. >+ * >+ * Address has to be a valid GPU virtual address at the time of >+ * first level batch completion. >+ */ >+ __u64 addr; >+ >+ /** >+ * @value: User/Memory fence Value to be written to >above address >+ * after first level batch completes. >+ */ >+ __u64 value; >+ >+ /** @rsvd: Reserved for future extensions, MBZ */ >+ __u64 rsvd; >+}; >+ >+/** >+ * struct drm_i915_gem_create_ext_vm_private - Extension >to make the object >+ * private to the specified VM. >+ * >+ * See struct drm_i915_gem_create_ext. >+ */ >+struct drm_i915_gem_create_ext_vm_private { >+#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 >+ /** @base: Extension link. See struct i915_user_extension. */ >+ struct i915_user_extension base; >+ >+ /** @vm_id: Id of the VM to which the object is private */ >+ __u32 vm_id; >+}; >+ >+/** >+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence. >+ * >+ * User/Memory fence can be woken up either by: >+ * >+ * 1. GPU context indicated by @ctx_id, or, >+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT. >+ * @ctx_id is ignored when this flag is set. >+ * >+ * Wakeup condition is, >+ * ``((*addr & mask) op (value & mask))`` >+ * >+ * See :ref:`Documentation/driver-api/dma-buf.rst ><indefinite_dma_fences>` >+ */ >+struct drm_i915_gem_wait_user_fence { >+ /** @extensions: Zero-terminated chain of extensions. */ >+ __u64 extensions; >+ >+ /** @addr: User/Memory fence address */ >+ __u64 addr; >+ >+ /** @ctx_id: Id of the Context which will signal the fence. */ >+ __u32 ctx_id; >+ >+ /** @op: Wakeup condition operator */ >+ __u16 op; >+#define I915_UFENCE_WAIT_EQ 0 >+#define I915_UFENCE_WAIT_NEQ 1 >+#define I915_UFENCE_WAIT_GT 2 >+#define I915_UFENCE_WAIT_GTE 3 >+#define I915_UFENCE_WAIT_LT 4 >+#define I915_UFENCE_WAIT_LTE 5 >+#define I915_UFENCE_WAIT_BEFORE 6 >+#define I915_UFENCE_WAIT_AFTER 7 >+ >+ /** >+ * @flags: Supported flags are, >+ * >+ * I915_UFENCE_WAIT_SOFT: >+ * >+ * To be woken up by i915 driver async worker (not by GPU). >+ * >+ * I915_UFENCE_WAIT_ABSTIME: >+ * >+ * Wait timeout specified as absolute time. >+ */ >+ __u16 flags; >+#define I915_UFENCE_WAIT_SOFT 0x1 >+#define I915_UFENCE_WAIT_ABSTIME 0x2 >+ >+ /** @value: Wakeup value */ >+ __u64 value; >+ >+ /** @mask: Wakeup mask */ >+ __u64 mask; >+#define I915_UFENCE_WAIT_U8 0xffu >+#define I915_UFENCE_WAIT_U16 0xffffu >+#define I915_UFENCE_WAIT_U32 0xfffffffful >+#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull >+ >+ /** >+ * @timeout: Wait timeout in nanoseconds. >+ * >+ * If I915_UFENCE_WAIT_ABSTIME flag is set, then time >timeout is the >+ * absolute time in nsec. >+ */ >+ __s64 timeout; >+};
On 09/06/2022 19:53, Niranjana Vishwanathapura wrote:
On Thu, Jun 09, 2022 at 09:36:48AM +0100, Matthew Auld wrote:
On 08/06/2022 22:32, Niranjana Vishwanathapura wrote:
On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:
On 08/06/2022 08:17, Tvrtko Ursulin wrote:
On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:
On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote: > > On 17/05/2022 19:32, Niranjana Vishwanathapura wrote: >> VM_BIND and related uapi definitions >> >> v2: Ensure proper kernel-doc formatting with cross references. >> Also add new uapi and documentation as per review comments >> from Daniel. >> >> Signed-off-by: Niranjana Vishwanathapura >> niranjana.vishwanathapura@intel.com >> --- >> Documentation/gpu/rfc/i915_vm_bind.h | 399 >> +++++++++++++++++++++++++++ >> 1 file changed, 399 insertions(+) >> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h >> b/Documentation/gpu/rfc/i915_vm_bind.h >> new file mode 100644 >> index 000000000000..589c0a009107 >> --- /dev/null >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >> @@ -0,0 +1,399 @@ >> +/* SPDX-License-Identifier: MIT */ >> +/* >> + * Copyright © 2022 Intel Corporation >> + */ >> + >> +/** >> + * DOC: I915_PARAM_HAS_VM_BIND >> + * >> + * VM_BIND feature availability. >> + * See typedef drm_i915_getparam_t param. >> + */ >> +#define I915_PARAM_HAS_VM_BIND 57 >> + >> +/** >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >> + * >> + * Flag to opt-in for VM_BIND mode of binding during VM creation. >> + * See struct drm_i915_gem_vm_control flags. >> + * >> + * A VM in VM_BIND mode will not support the older execbuff >> mode of binding. >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist >> (ie., the >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must >> be provided >> + * to pass in the batch buffer addresses. >> + * >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags >> must be 0 >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag >> must always be >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >> + * The buffers_ptr, buffer_count, batch_start_offset and >> batch_len fields >> + * of struct drm_i915_gem_execbuffer2 are also not used and >> must be 0. >> + */ >> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0) >> + >> +/** >> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING >> + * >> + * Flag to declare context as long running. >> + * See struct drm_i915_gem_context_create_ext flags. >> + * >> + * Usage of dma-fence expects that they complete in reasonable >> amount of time. >> + * Compute on the other hand can be long running. Hence it is >> not appropriate >> + * for compute contexts to export request completion dma-fence >> to user. >> + * The dma-fence usage will be limited to in-kernel consumption >> only. >> + * Compute contexts need to use user/memory fence. >> + * >> + * So, long running contexts do not support output fences. Hence, >> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and >> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) >> are expected >> + * to be not used. >> + * >> + * DRM_I915_GEM_WAIT ioctl call is also not supported for >> objects mapped >> + * to long running contexts. >> + */ >> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2) >> + >> +/* VM_BIND related ioctls */ >> +#define DRM_I915_GEM_VM_BIND 0x3d >> +#define DRM_I915_GEM_VM_UNBIND 0x3e >> +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f >> + >> +#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + >> DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) >> +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE >> + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) >> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE >> DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct >> drm_i915_gem_wait_user_fence) >> + >> +/** >> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind. >> + * >> + * This structure is passed to VM_BIND ioctl and specifies the >> mapping of GPU >> + * virtual address (VA) range to the section of an object that >> should be bound >> + * in the device page table of the specified address space (VM). >> + * The VA range specified must be unique (ie., not currently >> bound) and can >> + * be mapped to whole object or a section of the object >> (partial binding). >> + * Multiple VA mappings can be created to the same section of >> the object >> + * (aliasing). >> + */ >> +struct drm_i915_gem_vm_bind { >> + /** @vm_id: VM (address space) id to bind */ >> + __u32 vm_id; >> + >> + /** @handle: Object handle */ >> + __u32 handle; >> + >> + /** @start: Virtual Address start to bind */ >> + __u64 start; >> + >> + /** @offset: Offset in object to bind */ >> + __u64 offset; >> + >> + /** @length: Length of mapping to bind */ >> + __u64 length; > > Does it support, or should it, equivalent of > EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map > the remainder of the space to a dummy object? In which case would > there be any alignment/padding issues preventing the two bind to > be placed next to each other? > > I ask because someone from the compute side asked me about a > problem with their strategy of dealing with overfetch and I > suggested pad to size. >
Thanks Tvrtko, I think we shouldn't be needing it. As with VM_BIND VA assignment is completely pushed to userspace, no padding should be necessary once the 'start' and 'size' alignment conditions are met.
I will add some documentation on alignment requirement here. Generally, 'start' and 'size' should be 4K aligned. But, I think when we have 64K lmem page sizes (dg2 and xehpsdv), they need to be 64K aligned.
- Matt
Align to 64k is enough for all overfetch issues?
Apparently compute has a situation where a buffer is received by one component and another has to apply more alignment to it, to deal with overfetch. Since they cannot grow the actual BO if they wanted to VM_BIND a scratch area on top? Or perhaps none of this is a problem on discrete and original BO should be correctly allocated to start with.
Side question - what about the align to 2MiB mentioned in i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to discrete?
Not sure about the overfetch thing, but yeah dg2 & xehpsdv both require a minimum of 64K pages underneath for local memory, and the BO size will also be rounded up accordingly. And yeah the complication arises due to not being able to mix 4K + 64K GTT pages within the same page-table (existed since even gen8). Note that 4K here is what we typically get for system memory.
Originally we had a memory coloring scheme to track the "color" of each page-table, which basically ensures that userspace can't do something nasty like mixing page sizes. The advantage of that scheme is that we would only require 64K GTT alignment and no extra padding, but is perhaps a little complex.
The merged solution is just to align and pad (i.e vma->node.size and not vma->size) out of the vma to 2M, which is dead simple implementation wise, but does potentially waste some GTT space and some of the local memory used for the actual page-table. For the alignment the kernel just validates that the GTT address is aligned to 2M in vma_insert(), and then for the padding it just inflates it to 2M, if userspace hasn't already.
See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_cr...
Ok, those requirements (2M VA alignment) will apply to VM_BIND also. This is unfortunate, but it is not something new enforced by VM_BIND. Other option is to go with 64K alignment and in VM_BIND case, user must ensure there is no mix-matching of 64K (lmem) and 4k (smem) mappings in the same 2M range. But this is not VM_BIND specific (will apply to soft-pinning in execbuf2 also).
I don't think we need any VA padding here as with VM_BIND VA is managed fully by the user. If we enforce VA to be 2M aligned, it will leave holes (if BOs are smaller then 2M), but nobody is going to allocate anything form there.
Note that we only apply the 2M alignment + padding for local memory pages, for system memory we don't have/need such restrictions. The VA padding then importantly prevents userspace from incorrectly (or maliciously) inserting 4K system memory object in some page-table operating in 64K GTT mode.
Thanks Matt. I also, syned offline with Matt a bit on this. We don't need explicit 'pad_to_size' size. i915 driver is implicitly padding the size to 2M boundary for LMEM BOs which will apply for VM_BIND also. The remaining question is whether we enforce 2M VA alignment for lmem BOs (just like legacy execbuff path) on dg2 & xehpsdv, or go with just 64K alignment but ensure there is no mixing of 4K and 64K
"Driver is implicitly padding the size to 2MB boundary" - this is the backing store?
mappings in same 2M range. I think we can go with 2M alignment requirement for VM_BIND also. So, no new requirements here for VM_BIND.
Are there any considerations here of letting the userspace know? Presumably userspace allocator has to know or it would try to ask for impossible addresses.
Regards,
Tvrtko
I will update the documentation.
Niranjana
Niranjana
Regards,
Tvrtko
Niranjana
> Regards, > > Tvrtko > >> + >> + /** >> + * @flags: Supported flags are, >> + * >> + * I915_GEM_VM_BIND_READONLY: >> + * Mapping is read-only. >> + * >> + * I915_GEM_VM_BIND_CAPTURE: >> + * Capture this mapping in the dump upon GPU error. >> + */ >> + __u64 flags; >> +#define I915_GEM_VM_BIND_READONLY (1 << 0) >> +#define I915_GEM_VM_BIND_CAPTURE (1 << 1) >> + >> + /** @extensions: 0-terminated chain of extensions for this >> mapping. */ >> + __u64 extensions; >> +}; >> + >> +/** >> + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind. >> + * >> + * This structure is passed to VM_UNBIND ioctl and specifies >> the GPU virtual >> + * address (VA) range that should be unbound from the device >> page table of the >> + * specified address space (VM). The specified VA range must >> match one of the >> + * mappings created with the VM_BIND ioctl. TLB is flushed upon >> unbind >> + * completion. >> + */ >> +struct drm_i915_gem_vm_unbind { >> + /** @vm_id: VM (address space) id to bind */ >> + __u32 vm_id; >> + >> + /** @rsvd: Reserved for future use; must be zero. */ >> + __u32 rsvd; >> + >> + /** @start: Virtual Address start to unbind */ >> + __u64 start; >> + >> + /** @length: Length of mapping to unbind */ >> + __u64 length; >> + >> + /** @flags: reserved for future usage, currently MBZ */ >> + __u64 flags; >> + >> + /** @extensions: 0-terminated chain of extensions for this >> mapping. */ >> + __u64 extensions; >> +}; >> + >> +/** >> + * struct drm_i915_vm_bind_fence - An input or output fence for >> the vm_bind >> + * or the vm_unbind work. >> + * >> + * The vm_bind or vm_unbind aync worker will wait for input >> fence to signal >> + * before starting the binding or unbinding. >> + * >> + * The vm_bind or vm_unbind async worker will signal the >> returned output fence >> + * after the completion of binding or unbinding. >> + */ >> +struct drm_i915_vm_bind_fence { >> + /** @handle: User's handle for a drm_syncobj to wait on or >> signal. */ >> + __u32 handle; >> + >> + /** >> + * @flags: Supported flags are, >> + * >> + * I915_VM_BIND_FENCE_WAIT: >> + * Wait for the input fence before binding/unbinding >> + * >> + * I915_VM_BIND_FENCE_SIGNAL: >> + * Return bind/unbind completion fence as output >> + */ >> + __u32 flags; >> +#define I915_VM_BIND_FENCE_WAIT (1<<0) >> +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) >> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS >> (-(I915_VM_BIND_FENCE_SIGNAL << 1)) >> +}; >> + >> +/** >> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline >> fences for vm_bind >> + * and vm_unbind. >> + * >> + * This structure describes an array of timeline drm_syncobj >> and associated >> + * points for timeline variants of drm_syncobj. These timeline >> 'drm_syncobj's >> + * can be input or output fences (See struct >> drm_i915_vm_bind_fence). >> + */ >> +struct drm_i915_vm_bind_ext_timeline_fences { >> +#define I915_VM_BIND_EXT_timeline_FENCES 0 >> + /** @base: Extension link. See struct i915_user_extension. */ >> + struct i915_user_extension base; >> + >> + /** >> + * @fence_count: Number of elements in the @handles_ptr & >> @value_ptr >> + * arrays. >> + */ >> + __u64 fence_count; >> + >> + /** >> + * @handles_ptr: Pointer to an array of struct >> drm_i915_vm_bind_fence >> + * of length @fence_count. >> + */ >> + __u64 handles_ptr; >> + >> + /** >> + * @values_ptr: Pointer to an array of u64 values of length >> + * @fence_count. >> + * Values must be 0 for a binary drm_syncobj. A Value of 0 >> for a >> + * timeline drm_syncobj is invalid as it turns a >> drm_syncobj into a >> + * binary one. >> + */ >> + __u64 values_ptr; >> +}; >> + >> +/** >> + * struct drm_i915_vm_bind_user_fence - An input or output user >> fence for the >> + * vm_bind or the vm_unbind work. >> + * >> + * The vm_bind or vm_unbind aync worker will wait for the input >> fence (value at >> + * @addr to become equal to @val) before starting the binding >> or unbinding. >> + * >> + * The vm_bind or vm_unbind async worker will signal the output >> fence after >> + * the completion of binding or unbinding by writing @val to >> memory location at >> + * @addr >> + */ >> +struct drm_i915_vm_bind_user_fence { >> + /** @addr: User/Memory fence qword aligned process virtual >> address */ >> + __u64 addr; >> + >> + /** @val: User/Memory fence value to be written after bind >> completion */ >> + __u64 val; >> + >> + /** >> + * @flags: Supported flags are, >> + * >> + * I915_VM_BIND_USER_FENCE_WAIT: >> + * Wait for the input fence before binding/unbinding >> + * >> + * I915_VM_BIND_USER_FENCE_SIGNAL: >> + * Return bind/unbind completion fence as output >> + */ >> + __u32 flags; >> +#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) >> +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) >> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ >> + (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) >> +}; >> + >> +/** >> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences >> for vm_bind >> + * and vm_unbind. >> + * >> + * These user fences can be input or output fences >> + * (See struct drm_i915_vm_bind_user_fence). >> + */ >> +struct drm_i915_vm_bind_ext_user_fence { >> +#define I915_VM_BIND_EXT_USER_FENCES 1 >> + /** @base: Extension link. See struct i915_user_extension. */ >> + struct i915_user_extension base; >> + >> + /** @fence_count: Number of elements in the @user_fence_ptr >> array. */ >> + __u64 fence_count; >> + >> + /** >> + * @user_fence_ptr: Pointer to an array of >> + * struct drm_i915_vm_bind_user_fence of length @fence_count. >> + */ >> + __u64 user_fence_ptr; >> +}; >> + >> +/** >> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array >> of batch buffer >> + * gpu virtual addresses. >> + * >> + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), >> this extension >> + * must always be appended in the VM_BIND mode and it will be >> an error to >> + * append this extension in older non-VM_BIND mode. >> + */ >> +struct drm_i915_gem_execbuffer_ext_batch_addresses { >> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1 >> + /** @base: Extension link. See struct i915_user_extension. */ >> + struct i915_user_extension base; >> + >> + /** @count: Number of addresses in the addr array. */ >> + __u32 count; >> + >> + /** @addr: An array of batch gpu virtual addresses. */ >> + __u64 addr[0]; >> +}; >> + >> +/** >> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level >> batch completion >> + * signaling extension. >> + * >> + * This extension allows user to attach a user fence (@addr, >> @value pair) to an >> + * execbuf to be signaled by the command streamer after the >> completion of first >> + * level batch, by writing the @value at specified @addr and >> triggering an >> + * interrupt. >> + * User can either poll for this user fence to signal or can >> also wait on it >> + * with i915_gem_wait_user_fence ioctl. >> + * This is very much usefaul for long running contexts where >> waiting on dma-fence >> + * by user (like i915_gem_wait ioctl) is not supported. >> + */ >> +struct drm_i915_gem_execbuffer_ext_user_fence { >> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2 >> + /** @base: Extension link. See struct i915_user_extension. */ >> + struct i915_user_extension base; >> + >> + /** >> + * @addr: User/Memory fence qword aligned GPU virtual address. >> + * >> + * Address has to be a valid GPU virtual address at the >> time of >> + * first level batch completion. >> + */ >> + __u64 addr; >> + >> + /** >> + * @value: User/Memory fence Value to be written to above >> address >> + * after first level batch completes. >> + */ >> + __u64 value; >> + >> + /** @rsvd: Reserved for future extensions, MBZ */ >> + __u64 rsvd; >> +}; >> + >> +/** >> + * struct drm_i915_gem_create_ext_vm_private - Extension to >> make the object >> + * private to the specified VM. >> + * >> + * See struct drm_i915_gem_create_ext. >> + */ >> +struct drm_i915_gem_create_ext_vm_private { >> +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 >> + /** @base: Extension link. See struct i915_user_extension. */ >> + struct i915_user_extension base; >> + >> + /** @vm_id: Id of the VM to which the object is private */ >> + __u32 vm_id; >> +}; >> + >> +/** >> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory >> fence. >> + * >> + * User/Memory fence can be woken up either by: >> + * >> + * 1. GPU context indicated by @ctx_id, or, >> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT. >> + * @ctx_id is ignored when this flag is set. >> + * >> + * Wakeup condition is, >> + * ``((*addr & mask) op (value & mask))`` >> + * >> + * See :ref:`Documentation/driver-api/dma-buf.rst >> <indefinite_dma_fences>` >> + */ >> +struct drm_i915_gem_wait_user_fence { >> + /** @extensions: Zero-terminated chain of extensions. */ >> + __u64 extensions; >> + >> + /** @addr: User/Memory fence address */ >> + __u64 addr; >> + >> + /** @ctx_id: Id of the Context which will signal the fence. */ >> + __u32 ctx_id; >> + >> + /** @op: Wakeup condition operator */ >> + __u16 op; >> +#define I915_UFENCE_WAIT_EQ 0 >> +#define I915_UFENCE_WAIT_NEQ 1 >> +#define I915_UFENCE_WAIT_GT 2 >> +#define I915_UFENCE_WAIT_GTE 3 >> +#define I915_UFENCE_WAIT_LT 4 >> +#define I915_UFENCE_WAIT_LTE 5 >> +#define I915_UFENCE_WAIT_BEFORE 6 >> +#define I915_UFENCE_WAIT_AFTER 7 >> + >> + /** >> + * @flags: Supported flags are, >> + * >> + * I915_UFENCE_WAIT_SOFT: >> + * >> + * To be woken up by i915 driver async worker (not by GPU). >> + * >> + * I915_UFENCE_WAIT_ABSTIME: >> + * >> + * Wait timeout specified as absolute time. >> + */ >> + __u16 flags; >> +#define I915_UFENCE_WAIT_SOFT 0x1 >> +#define I915_UFENCE_WAIT_ABSTIME 0x2 >> + >> + /** @value: Wakeup value */ >> + __u64 value; >> + >> + /** @mask: Wakeup mask */ >> + __u64 mask; >> +#define I915_UFENCE_WAIT_U8 0xffu >> +#define I915_UFENCE_WAIT_U16 0xffffu >> +#define I915_UFENCE_WAIT_U32 0xfffffffful >> +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull >> + >> + /** >> + * @timeout: Wait timeout in nanoseconds. >> + * >> + * If I915_UFENCE_WAIT_ABSTIME flag is set, then time >> timeout is the >> + * absolute time in nsec. >> + */ >> + __s64 timeout; >> +};
On 10/06/2022 11:16, Tvrtko Ursulin wrote:
On 09/06/2022 19:53, Niranjana Vishwanathapura wrote:
On Thu, Jun 09, 2022 at 09:36:48AM +0100, Matthew Auld wrote:
On 08/06/2022 22:32, Niranjana Vishwanathapura wrote:
On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:
On 08/06/2022 08:17, Tvrtko Ursulin wrote:
On 07/06/2022 20:37, Niranjana Vishwanathapura wrote: > On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote: >> >> On 17/05/2022 19:32, Niranjana Vishwanathapura wrote: >>> VM_BIND and related uapi definitions >>> >>> v2: Ensure proper kernel-doc formatting with cross references. >>> Also add new uapi and documentation as per review comments >>> from Daniel. >>> >>> Signed-off-by: Niranjana Vishwanathapura >>> niranjana.vishwanathapura@intel.com >>> --- >>> Documentation/gpu/rfc/i915_vm_bind.h | 399 >>> +++++++++++++++++++++++++++ >>> 1 file changed, 399 insertions(+) >>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >>> >>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h >>> b/Documentation/gpu/rfc/i915_vm_bind.h >>> new file mode 100644 >>> index 000000000000..589c0a009107 >>> --- /dev/null >>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >>> @@ -0,0 +1,399 @@ >>> +/* SPDX-License-Identifier: MIT */ >>> +/* >>> + * Copyright © 2022 Intel Corporation >>> + */ >>> + >>> +/** >>> + * DOC: I915_PARAM_HAS_VM_BIND >>> + * >>> + * VM_BIND feature availability. >>> + * See typedef drm_i915_getparam_t param. >>> + */ >>> +#define I915_PARAM_HAS_VM_BIND 57 >>> + >>> +/** >>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >>> + * >>> + * Flag to opt-in for VM_BIND mode of binding during VM creation. >>> + * See struct drm_i915_gem_vm_control flags. >>> + * >>> + * A VM in VM_BIND mode will not support the older execbuff >>> mode of binding. >>> + * In VM_BIND mode, execbuff ioctl will not accept any >>> execlist (ie., the >>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >>> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must >>> be provided >>> + * to pass in the batch buffer addresses. >>> + * >>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >>> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags >>> must be 0 >>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag >>> must always be >>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >>> + * The buffers_ptr, buffer_count, batch_start_offset and >>> batch_len fields >>> + * of struct drm_i915_gem_execbuffer2 are also not used and >>> must be 0. >>> + */ >>> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0) >>> + >>> +/** >>> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING >>> + * >>> + * Flag to declare context as long running. >>> + * See struct drm_i915_gem_context_create_ext flags. >>> + * >>> + * Usage of dma-fence expects that they complete in reasonable >>> amount of time. >>> + * Compute on the other hand can be long running. Hence it is >>> not appropriate >>> + * for compute contexts to export request completion dma-fence >>> to user. >>> + * The dma-fence usage will be limited to in-kernel >>> consumption only. >>> + * Compute contexts need to use user/memory fence. >>> + * >>> + * So, long running contexts do not support output fences. Hence, >>> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and >>> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) >>> are expected >>> + * to be not used. >>> + * >>> + * DRM_I915_GEM_WAIT ioctl call is also not supported for >>> objects mapped >>> + * to long running contexts. >>> + */ >>> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2) >>> + >>> +/* VM_BIND related ioctls */ >>> +#define DRM_I915_GEM_VM_BIND 0x3d >>> +#define DRM_I915_GEM_VM_UNBIND 0x3e >>> +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f >>> + >>> +#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + >>> DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) >>> +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE >>> + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) >>> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE >>> DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, >>> struct drm_i915_gem_wait_user_fence) >>> + >>> +/** >>> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind. >>> + * >>> + * This structure is passed to VM_BIND ioctl and specifies the >>> mapping of GPU >>> + * virtual address (VA) range to the section of an object that >>> should be bound >>> + * in the device page table of the specified address space (VM). >>> + * The VA range specified must be unique (ie., not currently >>> bound) and can >>> + * be mapped to whole object or a section of the object >>> (partial binding). >>> + * Multiple VA mappings can be created to the same section of >>> the object >>> + * (aliasing). >>> + */ >>> +struct drm_i915_gem_vm_bind { >>> + /** @vm_id: VM (address space) id to bind */ >>> + __u32 vm_id; >>> + >>> + /** @handle: Object handle */ >>> + __u32 handle; >>> + >>> + /** @start: Virtual Address start to bind */ >>> + __u64 start; >>> + >>> + /** @offset: Offset in object to bind */ >>> + __u64 offset; >>> + >>> + /** @length: Length of mapping to bind */ >>> + __u64 length; >> >> Does it support, or should it, equivalent of >> EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map >> the remainder of the space to a dummy object? In which case >> would there be any alignment/padding issues preventing the two >> bind to be placed next to each other? >> >> I ask because someone from the compute side asked me about a >> problem with their strategy of dealing with overfetch and I >> suggested pad to size. >> > > Thanks Tvrtko, > I think we shouldn't be needing it. As with VM_BIND VA assignment > is completely pushed to userspace, no padding should be necessary > once the 'start' and 'size' alignment conditions are met. > > I will add some documentation on alignment requirement here. > Generally, 'start' and 'size' should be 4K aligned. But, I think > when we have 64K lmem page sizes (dg2 and xehpsdv), they need to > be 64K aligned.
- Matt
Align to 64k is enough for all overfetch issues?
Apparently compute has a situation where a buffer is received by one component and another has to apply more alignment to it, to deal with overfetch. Since they cannot grow the actual BO if they wanted to VM_BIND a scratch area on top? Or perhaps none of this is a problem on discrete and original BO should be correctly allocated to start with.
Side question - what about the align to 2MiB mentioned in i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to discrete?
Not sure about the overfetch thing, but yeah dg2 & xehpsdv both require a minimum of 64K pages underneath for local memory, and the BO size will also be rounded up accordingly. And yeah the complication arises due to not being able to mix 4K + 64K GTT pages within the same page-table (existed since even gen8). Note that 4K here is what we typically get for system memory.
Originally we had a memory coloring scheme to track the "color" of each page-table, which basically ensures that userspace can't do something nasty like mixing page sizes. The advantage of that scheme is that we would only require 64K GTT alignment and no extra padding, but is perhaps a little complex.
The merged solution is just to align and pad (i.e vma->node.size and not vma->size) out of the vma to 2M, which is dead simple implementation wise, but does potentially waste some GTT space and some of the local memory used for the actual page-table. For the alignment the kernel just validates that the GTT address is aligned to 2M in vma_insert(), and then for the padding it just inflates it to 2M, if userspace hasn't already.
See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_cr...
Ok, those requirements (2M VA alignment) will apply to VM_BIND also. This is unfortunate, but it is not something new enforced by VM_BIND. Other option is to go with 64K alignment and in VM_BIND case, user must ensure there is no mix-matching of 64K (lmem) and 4k (smem) mappings in the same 2M range. But this is not VM_BIND specific (will apply to soft-pinning in execbuf2 also).
I don't think we need any VA padding here as with VM_BIND VA is managed fully by the user. If we enforce VA to be 2M aligned, it will leave holes (if BOs are smaller then 2M), but nobody is going to allocate anything form there.
Note that we only apply the 2M alignment + padding for local memory pages, for system memory we don't have/need such restrictions. The VA padding then importantly prevents userspace from incorrectly (or maliciously) inserting 4K system memory object in some page-table operating in 64K GTT mode.
Thanks Matt. I also, syned offline with Matt a bit on this. We don't need explicit 'pad_to_size' size. i915 driver is implicitly padding the size to 2M boundary for LMEM BOs which will apply for VM_BIND also. The remaining question is whether we enforce 2M VA alignment for lmem BOs (just like legacy execbuff path) on dg2 & xehpsdv, or go with just 64K alignment but ensure there is no mixing of 4K and 64K
"Driver is implicitly padding the size to 2MB boundary" - this is the backing store?
Just the GTT space, i.e vma->node.size. Backing store just needs to use 64K pages.
mappings in same 2M range. I think we can go with 2M alignment requirement for VM_BIND also. So, no new requirements here for VM_BIND.
Are there any considerations here of letting the userspace know? Presumably userspace allocator has to know or it would try to ask for impossible addresses.
It's the existing behaviour with execbuf, so I assume userspace must already get this right, on platforms like dg2.
Regards,
Tvrtko
I will update the documentation.
Niranjana
Niranjana
Regards,
Tvrtko
> > Niranjana > >> Regards, >> >> Tvrtko >> >>> + >>> + /** >>> + * @flags: Supported flags are, >>> + * >>> + * I915_GEM_VM_BIND_READONLY: >>> + * Mapping is read-only. >>> + * >>> + * I915_GEM_VM_BIND_CAPTURE: >>> + * Capture this mapping in the dump upon GPU error. >>> + */ >>> + __u64 flags; >>> +#define I915_GEM_VM_BIND_READONLY (1 << 0) >>> +#define I915_GEM_VM_BIND_CAPTURE (1 << 1) >>> + >>> + /** @extensions: 0-terminated chain of extensions for this >>> mapping. */ >>> + __u64 extensions; >>> +}; >>> + >>> +/** >>> + * struct drm_i915_gem_vm_unbind - VA to object mapping to >>> unbind. >>> + * >>> + * This structure is passed to VM_UNBIND ioctl and specifies >>> the GPU virtual >>> + * address (VA) range that should be unbound from the device >>> page table of the >>> + * specified address space (VM). The specified VA range must >>> match one of the >>> + * mappings created with the VM_BIND ioctl. TLB is flushed >>> upon unbind >>> + * completion. >>> + */ >>> +struct drm_i915_gem_vm_unbind { >>> + /** @vm_id: VM (address space) id to bind */ >>> + __u32 vm_id; >>> + >>> + /** @rsvd: Reserved for future use; must be zero. */ >>> + __u32 rsvd; >>> + >>> + /** @start: Virtual Address start to unbind */ >>> + __u64 start; >>> + >>> + /** @length: Length of mapping to unbind */ >>> + __u64 length; >>> + >>> + /** @flags: reserved for future usage, currently MBZ */ >>> + __u64 flags; >>> + >>> + /** @extensions: 0-terminated chain of extensions for this >>> mapping. */ >>> + __u64 extensions; >>> +}; >>> + >>> +/** >>> + * struct drm_i915_vm_bind_fence - An input or output fence >>> for the vm_bind >>> + * or the vm_unbind work. >>> + * >>> + * The vm_bind or vm_unbind aync worker will wait for input >>> fence to signal >>> + * before starting the binding or unbinding. >>> + * >>> + * The vm_bind or vm_unbind async worker will signal the >>> returned output fence >>> + * after the completion of binding or unbinding. >>> + */ >>> +struct drm_i915_vm_bind_fence { >>> + /** @handle: User's handle for a drm_syncobj to wait on or >>> signal. */ >>> + __u32 handle; >>> + >>> + /** >>> + * @flags: Supported flags are, >>> + * >>> + * I915_VM_BIND_FENCE_WAIT: >>> + * Wait for the input fence before binding/unbinding >>> + * >>> + * I915_VM_BIND_FENCE_SIGNAL: >>> + * Return bind/unbind completion fence as output >>> + */ >>> + __u32 flags; >>> +#define I915_VM_BIND_FENCE_WAIT (1<<0) >>> +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) >>> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS >>> (-(I915_VM_BIND_FENCE_SIGNAL << 1)) >>> +}; >>> + >>> +/** >>> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline >>> fences for vm_bind >>> + * and vm_unbind. >>> + * >>> + * This structure describes an array of timeline drm_syncobj >>> and associated >>> + * points for timeline variants of drm_syncobj. These timeline >>> 'drm_syncobj's >>> + * can be input or output fences (See struct >>> drm_i915_vm_bind_fence). >>> + */ >>> +struct drm_i915_vm_bind_ext_timeline_fences { >>> +#define I915_VM_BIND_EXT_timeline_FENCES 0 >>> + /** @base: Extension link. See struct i915_user_extension. */ >>> + struct i915_user_extension base; >>> + >>> + /** >>> + * @fence_count: Number of elements in the @handles_ptr & >>> @value_ptr >>> + * arrays. >>> + */ >>> + __u64 fence_count; >>> + >>> + /** >>> + * @handles_ptr: Pointer to an array of struct >>> drm_i915_vm_bind_fence >>> + * of length @fence_count. >>> + */ >>> + __u64 handles_ptr; >>> + >>> + /** >>> + * @values_ptr: Pointer to an array of u64 values of length >>> + * @fence_count. >>> + * Values must be 0 for a binary drm_syncobj. A Value of 0 >>> for a >>> + * timeline drm_syncobj is invalid as it turns a >>> drm_syncobj into a >>> + * binary one. >>> + */ >>> + __u64 values_ptr; >>> +}; >>> + >>> +/** >>> + * struct drm_i915_vm_bind_user_fence - An input or output >>> user fence for the >>> + * vm_bind or the vm_unbind work. >>> + * >>> + * The vm_bind or vm_unbind aync worker will wait for the >>> input fence (value at >>> + * @addr to become equal to @val) before starting the binding >>> or unbinding. >>> + * >>> + * The vm_bind or vm_unbind async worker will signal the >>> output fence after >>> + * the completion of binding or unbinding by writing @val to >>> memory location at >>> + * @addr >>> + */ >>> +struct drm_i915_vm_bind_user_fence { >>> + /** @addr: User/Memory fence qword aligned process virtual >>> address */ >>> + __u64 addr; >>> + >>> + /** @val: User/Memory fence value to be written after bind >>> completion */ >>> + __u64 val; >>> + >>> + /** >>> + * @flags: Supported flags are, >>> + * >>> + * I915_VM_BIND_USER_FENCE_WAIT: >>> + * Wait for the input fence before binding/unbinding >>> + * >>> + * I915_VM_BIND_USER_FENCE_SIGNAL: >>> + * Return bind/unbind completion fence as output >>> + */ >>> + __u32 flags; >>> +#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) >>> +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) >>> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ >>> + (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) >>> +}; >>> + >>> +/** >>> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences >>> for vm_bind >>> + * and vm_unbind. >>> + * >>> + * These user fences can be input or output fences >>> + * (See struct drm_i915_vm_bind_user_fence). >>> + */ >>> +struct drm_i915_vm_bind_ext_user_fence { >>> +#define I915_VM_BIND_EXT_USER_FENCES 1 >>> + /** @base: Extension link. See struct i915_user_extension. */ >>> + struct i915_user_extension base; >>> + >>> + /** @fence_count: Number of elements in the >>> @user_fence_ptr array. */ >>> + __u64 fence_count; >>> + >>> + /** >>> + * @user_fence_ptr: Pointer to an array of >>> + * struct drm_i915_vm_bind_user_fence of length @fence_count. >>> + */ >>> + __u64 user_fence_ptr; >>> +}; >>> + >>> +/** >>> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array >>> of batch buffer >>> + * gpu virtual addresses. >>> + * >>> + * In the execbuff ioctl (See struct >>> drm_i915_gem_execbuffer2), this extension >>> + * must always be appended in the VM_BIND mode and it will be >>> an error to >>> + * append this extension in older non-VM_BIND mode. >>> + */ >>> +struct drm_i915_gem_execbuffer_ext_batch_addresses { >>> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1 >>> + /** @base: Extension link. See struct i915_user_extension. */ >>> + struct i915_user_extension base; >>> + >>> + /** @count: Number of addresses in the addr array. */ >>> + __u32 count; >>> + >>> + /** @addr: An array of batch gpu virtual addresses. */ >>> + __u64 addr[0]; >>> +}; >>> + >>> +/** >>> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level >>> batch completion >>> + * signaling extension. >>> + * >>> + * This extension allows user to attach a user fence (@addr, >>> @value pair) to an >>> + * execbuf to be signaled by the command streamer after the >>> completion of first >>> + * level batch, by writing the @value at specified @addr and >>> triggering an >>> + * interrupt. >>> + * User can either poll for this user fence to signal or can >>> also wait on it >>> + * with i915_gem_wait_user_fence ioctl. >>> + * This is very much usefaul for long running contexts where >>> waiting on dma-fence >>> + * by user (like i915_gem_wait ioctl) is not supported. >>> + */ >>> +struct drm_i915_gem_execbuffer_ext_user_fence { >>> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2 >>> + /** @base: Extension link. See struct i915_user_extension. */ >>> + struct i915_user_extension base; >>> + >>> + /** >>> + * @addr: User/Memory fence qword aligned GPU virtual >>> address. >>> + * >>> + * Address has to be a valid GPU virtual address at the >>> time of >>> + * first level batch completion. >>> + */ >>> + __u64 addr; >>> + >>> + /** >>> + * @value: User/Memory fence Value to be written to above >>> address >>> + * after first level batch completes. >>> + */ >>> + __u64 value; >>> + >>> + /** @rsvd: Reserved for future extensions, MBZ */ >>> + __u64 rsvd; >>> +}; >>> + >>> +/** >>> + * struct drm_i915_gem_create_ext_vm_private - Extension to >>> make the object >>> + * private to the specified VM. >>> + * >>> + * See struct drm_i915_gem_create_ext. >>> + */ >>> +struct drm_i915_gem_create_ext_vm_private { >>> +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 >>> + /** @base: Extension link. See struct i915_user_extension. */ >>> + struct i915_user_extension base; >>> + >>> + /** @vm_id: Id of the VM to which the object is private */ >>> + __u32 vm_id; >>> +}; >>> + >>> +/** >>> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory >>> fence. >>> + * >>> + * User/Memory fence can be woken up either by: >>> + * >>> + * 1. GPU context indicated by @ctx_id, or, >>> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT. >>> + * @ctx_id is ignored when this flag is set. >>> + * >>> + * Wakeup condition is, >>> + * ``((*addr & mask) op (value & mask))`` >>> + * >>> + * See :ref:`Documentation/driver-api/dma-buf.rst >>> <indefinite_dma_fences>` >>> + */ >>> +struct drm_i915_gem_wait_user_fence { >>> + /** @extensions: Zero-terminated chain of extensions. */ >>> + __u64 extensions; >>> + >>> + /** @addr: User/Memory fence address */ >>> + __u64 addr; >>> + >>> + /** @ctx_id: Id of the Context which will signal the >>> fence. */ >>> + __u32 ctx_id; >>> + >>> + /** @op: Wakeup condition operator */ >>> + __u16 op; >>> +#define I915_UFENCE_WAIT_EQ 0 >>> +#define I915_UFENCE_WAIT_NEQ 1 >>> +#define I915_UFENCE_WAIT_GT 2 >>> +#define I915_UFENCE_WAIT_GTE 3 >>> +#define I915_UFENCE_WAIT_LT 4 >>> +#define I915_UFENCE_WAIT_LTE 5 >>> +#define I915_UFENCE_WAIT_BEFORE 6 >>> +#define I915_UFENCE_WAIT_AFTER 7 >>> + >>> + /** >>> + * @flags: Supported flags are, >>> + * >>> + * I915_UFENCE_WAIT_SOFT: >>> + * >>> + * To be woken up by i915 driver async worker (not by GPU). >>> + * >>> + * I915_UFENCE_WAIT_ABSTIME: >>> + * >>> + * Wait timeout specified as absolute time. >>> + */ >>> + __u16 flags; >>> +#define I915_UFENCE_WAIT_SOFT 0x1 >>> +#define I915_UFENCE_WAIT_ABSTIME 0x2 >>> + >>> + /** @value: Wakeup value */ >>> + __u64 value; >>> + >>> + /** @mask: Wakeup mask */ >>> + __u64 mask; >>> +#define I915_UFENCE_WAIT_U8 0xffu >>> +#define I915_UFENCE_WAIT_U16 0xffffu >>> +#define I915_UFENCE_WAIT_U32 0xfffffffful >>> +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull >>> + >>> + /** >>> + * @timeout: Wait timeout in nanoseconds. >>> + * >>> + * If I915_UFENCE_WAIT_ABSTIME flag is set, then time >>> timeout is the >>> + * absolute time in nsec. >>> + */ >>> + __s64 timeout; >>> +};
On Tue, May 17, 2022 at 11:32:12AM -0700, Niranjana Vishwanathapura wrote:
VM_BIND and related uapi definitions
v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/**
- DOC: I915_PARAM_HAS_VM_BIND
- VM_BIND feature availability.
- See typedef drm_i915_getparam_t param.
- */
+#define I915_PARAM_HAS_VM_BIND 57
+/**
- DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND
- Flag to opt-in for VM_BIND mode of binding during VM creation.
- See struct drm_i915_gem_vm_control flags.
- A VM in VM_BIND mode will not support the older execbuff mode of binding.
- In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the
- &drm_i915_gem_execbuffer2.buffer_count must be 0).
- Also, &drm_i915_gem_execbuffer2.batch_start_offset and
- &drm_i915_gem_execbuffer2.batch_len must be 0.
- DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided
- to pass in the batch buffer addresses.
- Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and
- I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0
- (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be
- set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).
- The buffers_ptr, buffer_count, batch_start_offset and batch_len fields
- of struct drm_i915_gem_execbuffer2 are also not used and must be 0.
- */
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0)
+/**
- DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING
- Flag to declare context as long running.
- See struct drm_i915_gem_context_create_ext flags.
- Usage of dma-fence expects that they complete in reasonable amount of time.
- Compute on the other hand can be long running. Hence it is not appropriate
- for compute contexts to export request completion dma-fence to user.
- The dma-fence usage will be limited to in-kernel consumption only.
- Compute contexts need to use user/memory fence.
- So, long running contexts do not support output fences. Hence,
- I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and
- I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected
- to be not used.
- DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped
- to long running contexts.
- */
+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2)
+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f
+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+/**
- struct drm_i915_gem_vm_bind - VA to object mapping to bind.
- This structure is passed to VM_BIND ioctl and specifies the mapping of GPU
- virtual address (VA) range to the section of an object that should be bound
- in the device page table of the specified address space (VM).
- The VA range specified must be unique (ie., not currently bound) and can
- be mapped to whole object or a section of the object (partial binding).
- Multiple VA mappings can be created to the same section of the object
- (aliasing).
- */
+struct drm_i915_gem_vm_bind {
- /** @vm_id: VM (address space) id to bind */
- __u32 vm_id;
- /** @handle: Object handle */
- __u32 handle;
- /** @start: Virtual Address start to bind */
- __u64 start;
- /** @offset: Offset in object to bind */
- __u64 offset;
- /** @length: Length of mapping to bind */
- __u64 length;
- /**
* @flags: Supported flags are,
*
* I915_GEM_VM_BIND_READONLY:
* Mapping is read-only.
*
* I915_GEM_VM_BIND_CAPTURE:
* Capture this mapping in the dump upon GPU error.
*/
- __u64 flags;
+#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1)
- /** @extensions: 0-terminated chain of extensions for this mapping. */
- __u64 extensions;
+};
+/**
- struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.
- This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual
- address (VA) range that should be unbound from the device page table of the
- specified address space (VM). The specified VA range must match one of the
- mappings created with the VM_BIND ioctl. TLB is flushed upon unbind
- completion.
- */
+struct drm_i915_gem_vm_unbind {
- /** @vm_id: VM (address space) id to bind */
- __u32 vm_id;
- /** @rsvd: Reserved for future use; must be zero. */
- __u32 rsvd;
- /** @start: Virtual Address start to unbind */
- __u64 start;
- /** @length: Length of mapping to unbind */
- __u64 length;
This probably isn't needed. We are never going to unbind a subset of a VMA are we? That being said it can't hurt as a sanity check (e.g. internal vma->length == user unbind length).
- /** @flags: reserved for future usage, currently MBZ */
- __u64 flags;
- /** @extensions: 0-terminated chain of extensions for this mapping. */
- __u64 extensions;
+};
+/**
- struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind
- or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for input fence to signal
- before starting the binding or unbinding.
- The vm_bind or vm_unbind async worker will signal the returned output fence
- after the completion of binding or unbinding.
- */
+struct drm_i915_vm_bind_fence {
- /** @handle: User's handle for a drm_syncobj to wait on or signal. */
- __u32 handle;
- /**
* @flags: Supported flags are,
*
* I915_VM_BIND_FENCE_WAIT:
* Wait for the input fence before binding/unbinding
*
* I915_VM_BIND_FENCE_SIGNAL:
* Return bind/unbind completion fence as output
*/
- __u32 flags;
+#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};
+/**
- struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind
- and vm_unbind.
- This structure describes an array of timeline drm_syncobj and associated
- points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's
- can be input or output fences (See struct drm_i915_vm_bind_fence).
- */
+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /**
* @fence_count: Number of elements in the @handles_ptr & @value_ptr
* arrays.
*/
- __u64 fence_count;
- /**
* @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
* of length @fence_count.
*/
- __u64 handles_ptr;
- /**
* @values_ptr: Pointer to an array of u64 values of length
* @fence_count.
* Values must be 0 for a binary drm_syncobj. A Value of 0 for a
* timeline drm_syncobj is invalid as it turns a drm_syncobj into a
* binary one.
*/
- __u64 values_ptr;
+};
+/**
- struct drm_i915_vm_bind_user_fence - An input or output user fence for the
- vm_bind or the vm_unbind work.
- The vm_bind or vm_unbind aync worker will wait for the input fence (value at
- @addr to become equal to @val) before starting the binding or unbinding.
- The vm_bind or vm_unbind async worker will signal the output fence after
- the completion of binding or unbinding by writing @val to memory location at
- @addr
- */
+struct drm_i915_vm_bind_user_fence {
- /** @addr: User/Memory fence qword aligned process virtual address */
- __u64 addr;
- /** @val: User/Memory fence value to be written after bind completion */
- __u64 val;
- /**
* @flags: Supported flags are,
*
* I915_VM_BIND_USER_FENCE_WAIT:
* Wait for the input fence before binding/unbinding
*
* I915_VM_BIND_USER_FENCE_SIGNAL:
* Return bind/unbind completion fence as output
*/
- __u32 flags;
+#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
- (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
+};
+/**
- struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind
- and vm_unbind.
- These user fences can be input or output fences
- (See struct drm_i915_vm_bind_user_fence).
- */
+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /** @fence_count: Number of elements in the @user_fence_ptr array. */
- __u64 fence_count;
- /**
* @user_fence_ptr: Pointer to an array of
* struct drm_i915_vm_bind_user_fence of length @fence_count.
*/
- __u64 user_fence_ptr;
+};
IMO all of these fence structs should be a generic sync interface shared between both vm bind and exec3 rather than unique extenisons.
Both vm bind and exec3 should have something like this:
__64 syncs; /* userptr to an array of generic syncs */ __64 n_syncs;
Having an array of syncs lets the kernel do one user copy for all the syncs rather than reading them in a a chain.
A generic sync object encapsulates all possible syncs (in / out - syncobj, syncobj timeline, ufence, future sync concepts).
e.g.
struct { __u32 user_ext; __u32 flag; /* in / out, type, whatever else info we need */ union { __u32 handle; /* to syncobj */ __u64 addr; /* ufence address */ }; __64 seqno; /* syncobj timeline, ufence write value */ ...reserve enough bits for future... }
This unifies binds and execs by using the same sync interface instilling the concept that binds and execs are the same op (queue'd operation /w in/out fences).
Matt
+/**
- struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer
- gpu virtual addresses.
- In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension
- must always be appended in the VM_BIND mode and it will be an error to
- append this extension in older non-VM_BIND mode.
- */
+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /** @count: Number of addresses in the addr array. */
- __u32 count;
- /** @addr: An array of batch gpu virtual addresses. */
- __u64 addr[0];
+};
+/**
- struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
- signaling extension.
- This extension allows user to attach a user fence (@addr, @value pair) to an
- execbuf to be signaled by the command streamer after the completion of first
- level batch, by writing the @value at specified @addr and triggering an
- interrupt.
- User can either poll for this user fence to signal or can also wait on it
- with i915_gem_wait_user_fence ioctl.
- This is very much usefaul for long running contexts where waiting on dma-fence
- by user (like i915_gem_wait ioctl) is not supported.
- */
+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /**
* @addr: User/Memory fence qword aligned GPU virtual address.
*
* Address has to be a valid GPU virtual address at the time of
* first level batch completion.
*/
- __u64 addr;
- /**
* @value: User/Memory fence Value to be written to above address
* after first level batch completes.
*/
- __u64 value;
- /** @rsvd: Reserved for future extensions, MBZ */
- __u64 rsvd;
+};
+/**
- struct drm_i915_gem_create_ext_vm_private - Extension to make the object
- private to the specified VM.
- See struct drm_i915_gem_create_ext.
- */
+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /** @vm_id: Id of the VM to which the object is private */
- __u32 vm_id;
+};
+/**
- struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.
- User/Memory fence can be woken up either by:
- GPU context indicated by @ctx_id, or,
- Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
- @ctx_id is ignored when this flag is set.
- Wakeup condition is,
- ``((*addr & mask) op (value & mask))``
- See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`
- */
+struct drm_i915_gem_wait_user_fence {
- /** @extensions: Zero-terminated chain of extensions. */
- __u64 extensions;
- /** @addr: User/Memory fence address */
- __u64 addr;
- /** @ctx_id: Id of the Context which will signal the fence. */
- __u32 ctx_id;
- /** @op: Wakeup condition operator */
- __u16 op;
+#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7
- /**
* @flags: Supported flags are,
*
* I915_UFENCE_WAIT_SOFT:
*
* To be woken up by i915 driver async worker (not by GPU).
*
* I915_UFENCE_WAIT_ABSTIME:
*
* Wait timeout specified as absolute time.
*/
- __u16 flags;
+#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2
- /** @value: Wakeup value */
- __u64 value;
- /** @mask: Wakeup mask */
- __u64 mask;
+#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull
- /**
* @timeout: Wait timeout in nanoseconds.
*
* If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
* absolute time in nsec.
*/
- __s64 timeout;
+};
2.21.0.rc0.32.g243a4c7e27
dri-devel@lists.freedesktop.org