[RFC v3 0/3] drm/doc/rfc: i915 VM_BIND feature design + uapi

List overview All Threads
Download

newer

older

[PATCH] drm/arm/hdlcd: Take over...

[PATCH v2 0/2] Add R16 Vista E...

Niranjana Vishwanathapura

17 May 2022 17 May '22

6:32 p.m.

This is the i915 driver VM_BIND feature design RFC patch series along with the required uapi definition and description of intended use cases.

v2: Updated design and uapi, more documentation. v3: Add more documentation and proper kernel-doc formatting with cross references (including missing i915_drm uapi kernel-docs which are required) as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

Niranjana Vishwanathapura (3): drm/doc/rfc: VM_BIND feature design document drm/i915: Update i915 uapi documentation drm/doc/rfc: VM_BIND uapi definition

Documentation/driver-api/dma-buf.rst | 2 + Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++ Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + include/uapi/drm/i915_drm.h | 153 +++++++--- 5 files changed, 825 insertions(+), 37 deletions(-) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

-- 2.21.0.rc0.32.g243a4c7e27

Show replies by date

Niranjana Vishwanathapura

17 May 17 May

6:32 p.m.

New subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

VM_BIND design document with description of intended use cases.

v2: Add more documentation and format as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com --- Documentation/driver-api/dma-buf.rst | 2 + Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 3 files changed, 310 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 36a76cbe9095..64cb924ec5bb 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File .. kernel-doc:: include/linux/sync_file.h :internal:

+.. _indefinite_dma_fences: + Indefinite DMA Fences ~~~~~~~~~~~~~~~~~~~~~

diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..f1be560d313c --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,304 @@ +========================================== +I915 VM_BIND feature design and use cases +========================================== + +VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a +specified address space (VM). These mappings (also referred to as persistent +mappings) will be persistent across multiple GPU submissions (execbuff calls) +issued by the UMD, without user having to provide a list of all required +mappings during each submission (as required by older execbuff mode). + +VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace +to specify how the binding/unbinding should sync with other operations +like the GPU job submission. These fences will be timeline 'drm_syncobj's +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences). +For Compute contexts, they will be user/memory fences (See struct +drm_i915_vm_bind_ext_user_fence). + +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. + +VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete. + +VM_BIND features include: + +* Multiple Virtual Address (VA) mappings can map to the same physical pages + of an object (aliasing). +* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some + use cases will be helpful. +* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this). + +Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff mode of +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence, +no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases: + +"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/) + +This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses). + +If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later. + +In VM_BIND mode, VA allocation is completely managed by the user instead of +the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_). + +So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up +by clearly separating out the functionalities where the VM_BIND mode differs +from older method and they should be moved to separate files. + +VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM. + +VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs. + +VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode, the +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page faults manages +residency of backing storage using dma_fence. The VM_BIND mode with page faults +and the system allocator support do not use any dma_fence at all. + +VM_BIND locking order is as below. + +1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in + vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the + mapping. + + In future, when GPU page faults are supported, we can potentially use a + rwsem instead, so that multiple page fault handlers can take the read side + lock to lookup the mapping and hence can run in parallel. + The older execbuff mode of binding do not need this lock. + +2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to + be held while binding/unbinding a vma in the async worker and while updating + dma-resv fence list of an object. Note that private BOs of a VM will all + share a dma-resv object. + + The future system allocator support will use the HMM prescribed locking + instead. + +3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of + invalidated vmas (due to eviction and userptr invalidation) etc. + +When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we need to +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for +system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm lists, +so won't ever need lock-C. + +VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid +performance degradation. We will also need support for bulk LRU movement of +VM_BIND objects to avoid additional latencies in execbuff path. + +The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call with +that VM). So, bulk LRU movement of page table pages is also needed. + +The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture. + +VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency +setting (either through explicit or implicit mechanism). + +When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff submission. + +Also, in VM_BIND mode, use dma-resv apis for determining object activeness +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for igfx. + +Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated). + +Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful. + + +VM_BIND Compute support +======================== + +User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl. + +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it. + +User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation. + +When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds. + +This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/ + +Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during +context creation. The dma-fence based user interfaces like gem_wait ioctl and +execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode. + +Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption. + +As this support for context suspension using a preempt fence and the resume work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute mode +contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment. + +Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the directly +submitted jobs. + +Other VM_BIND use cases +======================== + +Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debugged) and attached +to GPU via vm_bind interface. + +GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of +binding will require using dma-fence to ensure residency, the GPU page faults +mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries. + +Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support. + +Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO. + +Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled. + + +Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +use cases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into. + +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature + do not use it and complexity it brings in is probably more than the + performance advantage we get in legacy execbuff case. +- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using + it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma + is active or not. + + +VM_BIND UAPI +============= + +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::

i915_scheduler.rst + +.. toctree:: + + i915_vm_bind.rst

-- 2.21.0.rc0.32.g243a4c7e27

Zanoni, Paulo R

19 May 19 May

10:52 p.m.

New subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:

...

VM_BIND design document with description of intended use cases.

v2: Add more documentation and format as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..f1be560d313c --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,304 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================

+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a +specified address space (VM). These mappings (also referred to as persistent +mappings) will be persistent across multiple GPU submissions (execbuff calls) +issued by the UMD, without user having to provide a list of all required +mappings during each submission (as required by older execbuff mode).

+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace +to specify how the binding/unbinding should sync with other operations +like the GPU job submission. These fences will be timeline 'drm_syncobj's +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences). +For Compute contexts, they will be user/memory fences (See struct +drm_i915_vm_bind_ext_user_fence).

+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.

+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.

+VM_BIND features include:

+* Multiple Virtual Address (VA) mappings can map to the same physical pages

of an object (aliasing).

+* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some

use cases will be helpful.

+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this).

+Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff mode of +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence, +no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases:

+"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/)

I would really like to have more details here. The link provided points to new ioctls and we're not very familiar with those yet, so I think you should really clarify the interaction between the new additions here. Having some sample code would be really nice too.

For Mesa at least (and I believe for the other drivers too) we always have a few exported buffers in every execbuf call, and we rely on the implicit synchronization provided by execbuf to make sure everything works. The execbuf ioctl also has some code to flush caches during implicit synchronization AFAIR, so I would guess we rely on it too and whatever else the Kernel does. Is that covered by the new ioctls?

In addition, as far as I remember, one of the big improvements of vm_bind was that it would help reduce ioctl latency and cpu overhead. But if making execbuf faster comes at the cost of requiring additional ioctls calls for implicit synchronization, which is required on ever execbuf call, then I wonder if we'll even get any faster at all. Comparing old execbuf vs plain new execbuf without the new required ioctls won't make sense.

But maybe I'm wrong and we won't need to call these new ioctls around every single execbuf ioctl we submit? Again, more clarification and some code examples here would be really nice. This is a big change on an important part of the API, we should clarify the new expected usage.

...

+This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

+If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later.

IMHO we really need to sort this and check all the assumptions before we commit to any interface. Again, implicit synchronization is something we rely on during *every* execbuf ioctl for most workloads.

...

+In VM_BIND mode, VA allocation is completely managed by the user instead of +the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_).

+So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up +by clearly separating out the functionalities where the VM_BIND mode differs +from older method and they should be moved to separate files.

I seem to recall some conversations where we were told a bunch of ioctls would stop working or make no sense to call when using vm_bind. Can we please get a complete list of those? Bonus points if the Kernel starts telling us we just called something that makes no sense.

...

+VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.

+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.

I know we already discussed this, but just to document it publicly: the ideal case for user space would be that every BO is created as private but then we'd have an ioctl to convert it to non-private (without the need to have a non-private->private interface).

An explanation on why we can't have an ioctl to mark as exported a buffer that was previously vm_private would be really appreciated.

Thanks, Paulo

...

+VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode, the +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page faults manages +residency of backing storage using dma_fence. The VM_BIND mode with page faults +and the system allocator support do not use any dma_fence at all.

+VM_BIND locking order is as below.

+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in

vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the

mapping.

In future, when GPU page faults are supported, we can potentially use a

rwsem instead, so that multiple page fault handlers can take the read side

lock to lookup the mapping and hence can run in parallel.

The older execbuff mode of binding do not need this lock.

+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to

be held while binding/unbinding a vma in the async worker and while updating

dma-resv fence list of an object. Note that private BOs of a VM will all

share a dma-resv object.

The future system allocator support will use the HMM prescribed locking

instead.

+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of

invalidated vmas (due to eviction and userptr invalidation) etc.

+When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we need to +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for +system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm lists, +so won't ever need lock-C.

+VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid +performance degradation. We will also need support for bulk LRU movement of +VM_BIND objects to avoid additional latencies in execbuff path.

+The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call with +that VM). So, bulk LRU movement of page table pages is also needed.

+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture.

+VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency +setting (either through explicit or implicit mechanism).

+When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff submission.

+Also, in VM_BIND mode, use dma-resv apis for determining object activeness +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for igfx.

+Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated).

+Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful.

+VM_BIND Compute support +========================

+User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl.

+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.

+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.

+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.

+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/

+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during +context creation. The dma-fence based user interfaces like gem_wait ioctl and +execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode.

+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption.

+As this support for context suspension using a preempt fence and the resume work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute mode +contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment.

+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the directly +submitted jobs.

+Other VM_BIND use cases +========================

+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debugged) and attached +to GPU via vm_bind interface.

+GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of +binding will require using dma-fence to ensure residency, the GPU page faults +mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries.

+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.

+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.

+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled.

+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +use cases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.

+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature

do not use it and complexity it brings in is probably more than the

performance advantage we get in legacy execbuff case.

+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using

it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma

is active or not.

+VM_BIND UAPI +=============

+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::

i915_scheduler.rst

+.. toctree::

i915_vm_bind.rst

Niranjana Vishwanathapura

23 May 23 May

7:05 p.m.

New subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Thu, May 19, 2022 at 03:52:01PM -0700, Zanoni, Paulo R wrote:

...

On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:

...
VM_BIND design document with description of intended use cases.

v2: Add more documentation and format as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..f1be560d313c --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,304 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================

+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a +specified address space (VM). These mappings (also referred to as persistent +mappings) will be persistent across multiple GPU submissions (execbuff calls) +issued by the UMD, without user having to provide a list of all required +mappings during each submission (as required by older execbuff mode).

+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace +to specify how the binding/unbinding should sync with other operations +like the GPU job submission. These fences will be timeline 'drm_syncobj's +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences). +For Compute contexts, they will be user/memory fences (See struct +drm_i915_vm_bind_ext_user_fence).

+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.

+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.

+VM_BIND features include:

+* Multiple Virtual Address (VA) mappings can map to the same physical pages

of an object (aliasing).

+* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some

use cases will be helpful.

+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this).

+Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff mode of +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence, +no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases:

+"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/)

I would really like to have more details here. The link provided points to new ioctls and we're not very familiar with those yet, so I think you should really clarify the interaction between the new additions here. Having some sample code would be really nice too.

For Mesa at least (and I believe for the other drivers too) we always have a few exported buffers in every execbuf call, and we rely on the implicit synchronization provided by execbuf to make sure everything works. The execbuf ioctl also has some code to flush caches during implicit synchronization AFAIR, so I would guess we rely on it too and whatever else the Kernel does. Is that covered by the new ioctls?

In addition, as far as I remember, one of the big improvements of vm_bind was that it would help reduce ioctl latency and cpu overhead. But if making execbuf faster comes at the cost of requiring additional ioctls calls for implicit synchronization, which is required on ever execbuf call, then I wonder if we'll even get any faster at all. Comparing old execbuf vs plain new execbuf without the new required ioctls won't make sense.

But maybe I'm wrong and we won't need to call these new ioctls around every single execbuf ioctl we submit? Again, more clarification and some code examples here would be really nice. This is a big change on an important part of the API, we should clarify the new expected usage.

Thanks Paulo for the comments.

In VM_BIND mode, the only reason we would need execlist support in execbuff path is for implicit synchronization. And AFAIK, this work from Jason is expected replace implict synchronization with new ioctls. Hence, VM_BIND mode will not be needing execlist support at all.

Based on comments from Daniel and my offline sync with Jason, this new mechanism from Jason is expected work for vl. For gl, there is a question of whether it will be performant or not. But it is worth trying that first. If it is not performant for gl, then only we can consider adding implicit sync support back for VM_BIND mode.

Daniel, Jason, Ken, any thoughts you can add here?

...

...

+This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

+If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later.

IMHO we really need to sort this and check all the assumptions before we commit to any interface. Again, implicit synchronization is something we rely on during *every* execbuf ioctl for most workloads.

Daniel's earlier feedback was that it is worth Mesa trying this new mechanism for gl and see it that works. We want to avoid supporting execlist support for implicit sync in vm_bind mode from the beginning if it is going to be deemed not necessary.

...

...
+In VM_BIND mode, VA allocation is completely managed by the user instead of +the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_).

+So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up +by clearly separating out the functionalities where the VM_BIND mode differs +from older method and they should be moved to separate files.

I seem to recall some conversations where we were told a bunch of ioctls would stop working or make no sense to call when using vm_bind. Can we please get a complete list of those? Bonus points if the Kernel starts telling us we just called something that makes no sense.

Which ioctls you are talking about here? We do not support GEM_WAIT ioctls, but that is only for compute mode (which is already documented in this patch).

...

...

+VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.

+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.

I know we already discussed this, but just to document it publicly: the ideal case for user space would be that every BO is created as private but then we'd have an ioctl to convert it to non-private (without the need to have a non-private->private interface).

An explanation on why we can't have an ioctl to mark as exported a buffer that was previously vm_private would be really appreciated.

Ok, I can some notes on that. The reason being the fact that this require changing the dma-resv object for gem object, hence the object locking also. This will add complications as we have to sync with any pending operations. It might be easier for UMDs to do it themselves by copying the object contexts to a new object.

Niranjana

...

Thanks, Paulo

...
+VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode, the +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page faults manages +residency of backing storage using dma_fence. The VM_BIND mode with page faults +and the system allocator support do not use any dma_fence at all.

+VM_BIND locking order is as below.

+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in

vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the

mapping.

In future, when GPU page faults are supported, we can potentially use a

rwsem instead, so that multiple page fault handlers can take the read side

lock to lookup the mapping and hence can run in parallel.

The older execbuff mode of binding do not need this lock.

+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to

be held while binding/unbinding a vma in the async worker and while updating

dma-resv fence list of an object. Note that private BOs of a VM will all

share a dma-resv object.

The future system allocator support will use the HMM prescribed locking

instead.

+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of

invalidated vmas (due to eviction and userptr invalidation) etc.

+When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we need to +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for +system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm lists, +so won't ever need lock-C.

+VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid +performance degradation. We will also need support for bulk LRU movement of +VM_BIND objects to avoid additional latencies in execbuff path.

+The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call with +that VM). So, bulk LRU movement of page table pages is also needed.

+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture.

+VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency +setting (either through explicit or implicit mechanism).

+When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff submission.

+Also, in VM_BIND mode, use dma-resv apis for determining object activeness +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for igfx.

+Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated).

+Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful.

+VM_BIND Compute support +========================

+User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl.

+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.

+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.

+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.

+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/

+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during +context creation. The dma-fence based user interfaces like gem_wait ioctl and +execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode.

+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption.

+As this support for context suspension using a preempt fence and the resume work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute mode +contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment.

+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the directly +submitted jobs.

+Other VM_BIND use cases +========================

+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debugged) and attached +to GPU via vm_bind interface.

+GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of +binding will require using dma-fence to ensure residency, the GPU page faults +mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries.

+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.

+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.

+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled.

+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +use cases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.

+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature

do not use it and complexity it brings in is probably more than the

performance advantage we get in legacy execbuff case.

+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using

it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma

is active or not.

+VM_BIND UAPI +=============

+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
 i915_scheduler.rst
+.. toctree::

i915_vm_bind.rst

Niranjana Vishwanathapura

7:08 p.m.

New subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Mon, May 23, 2022 at 12:05:05PM -0700, Niranjana Vishwanathapura wrote:

...

On Thu, May 19, 2022 at 03:52:01PM -0700, Zanoni, Paulo R wrote:

...
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:

...
VM_BIND design document with description of intended use cases.

v2: Add more documentation and format as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..f1be560d313c --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,304 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================

+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a +specified address space (VM). These mappings (also referred to as persistent +mappings) will be persistent across multiple GPU submissions (execbuff calls) +issued by the UMD, without user having to provide a list of all required +mappings during each submission (as required by older execbuff mode).

+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace +to specify how the binding/unbinding should sync with other operations +like the GPU job submission. These fences will be timeline 'drm_syncobj's +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences). +For Compute contexts, they will be user/memory fences (See struct +drm_i915_vm_bind_ext_user_fence).

+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.

+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.

+VM_BIND features include:

+* Multiple Virtual Address (VA) mappings can map to the same physical pages

of an object (aliasing).

+* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some

use cases will be helpful.

+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this).

+Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff mode of +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence, +no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases:

+"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/)

I would really like to have more details here. The link provided points to new ioctls and we're not very familiar with those yet, so I think you should really clarify the interaction between the new additions here. Having some sample code would be really nice too.

For Mesa at least (and I believe for the other drivers too) we always have a few exported buffers in every execbuf call, and we rely on the implicit synchronization provided by execbuf to make sure everything works. The execbuf ioctl also has some code to flush caches during implicit synchronization AFAIR, so I would guess we rely on it too and whatever else the Kernel does. Is that covered by the new ioctls?

In addition, as far as I remember, one of the big improvements of vm_bind was that it would help reduce ioctl latency and cpu overhead. But if making execbuf faster comes at the cost of requiring additional ioctls calls for implicit synchronization, which is required on ever execbuf call, then I wonder if we'll even get any faster at all. Comparing old execbuf vs plain new execbuf without the new required ioctls won't make sense.

But maybe I'm wrong and we won't need to call these new ioctls around every single execbuf ioctl we submit? Again, more clarification and some code examples here would be really nice. This is a big change on an important part of the API, we should clarify the new expected usage.

Thanks Paulo for the comments.

In VM_BIND mode, the only reason we would need execlist support in execbuff path is for implicit synchronization. And AFAIK, this work from Jason is expected replace implict synchronization with new ioctls. Hence, VM_BIND mode will not be needing execlist support at all.

Based on comments from Daniel and my offline sync with Jason, this new mechanism from Jason is expected work for vl. For gl, there is a question of whether it will be performant or not. But it is worth trying that first. If it is not performant for gl, then only we can consider adding implicit sync support back for VM_BIND mode.

Daniel, Jason, Ken, any thoughts you can add here?

CC'ing Ken.

...

...
...

+This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

+If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later.

IMHO we really need to sort this and check all the assumptions before we commit to any interface. Again, implicit synchronization is something we rely on during *every* execbuf ioctl for most workloads.

Daniel's earlier feedback was that it is worth Mesa trying this new mechanism for gl and see it that works. We want to avoid supporting execlist support for implicit sync in vm_bind mode from the beginning if it is going to be deemed not necessary.

...
...
+In VM_BIND mode, VA allocation is completely managed by the user instead of +the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_).

+So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up +by clearly separating out the functionalities where the VM_BIND mode differs +from older method and they should be moved to separate files.

I seem to recall some conversations where we were told a bunch of ioctls would stop working or make no sense to call when using vm_bind. Can we please get a complete list of those? Bonus points if the Kernel starts telling us we just called something that makes no sense.

Which ioctls you are talking about here? We do not support GEM_WAIT ioctls, but that is only for compute mode (which is already documented in this patch).

...
...

+VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.

+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.

I know we already discussed this, but just to document it publicly: the ideal case for user space would be that every BO is created as private but then we'd have an ioctl to convert it to non-private (without the need to have a non-private->private interface).

An explanation on why we can't have an ioctl to mark as exported a buffer that was previously vm_private would be really appreciated.

Ok, I can some notes on that. The reason being the fact that this require changing the dma-resv object for gem object, hence the object locking also. This will add complications as we have to sync with any pending operations. It might be easier for UMDs to do it themselves by copying the object contexts to a new object.

Niranjana

...
Thanks, Paulo

...
+VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode, the +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page faults manages +residency of backing storage using dma_fence. The VM_BIND mode with page faults +and the system allocator support do not use any dma_fence at all.

+VM_BIND locking order is as below.

+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in

vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the

mapping.

In future, when GPU page faults are supported, we can potentially use a

rwsem instead, so that multiple page fault handlers can take the read side

lock to lookup the mapping and hence can run in parallel.

The older execbuff mode of binding do not need this lock.

+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to

be held while binding/unbinding a vma in the async worker and while updating

dma-resv fence list of an object. Note that private BOs of a VM will all

share a dma-resv object.

The future system allocator support will use the HMM prescribed locking

instead.

+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of

invalidated vmas (due to eviction and userptr invalidation) etc.

+When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we need to +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for +system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm lists, +so won't ever need lock-C.

+VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid +performance degradation. We will also need support for bulk LRU movement of +VM_BIND objects to avoid additional latencies in execbuff path.

+The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call with +that VM). So, bulk LRU movement of page table pages is also needed.

+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture.

+VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency +setting (either through explicit or implicit mechanism).

+When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff submission.

+Also, in VM_BIND mode, use dma-resv apis for determining object activeness +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for igfx.

+Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated).

+Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful.

+VM_BIND Compute support +========================

+User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl.

+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.

+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.

+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.

+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/

+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during +context creation. The dma-fence based user interfaces like gem_wait ioctl and +execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode.

+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption.

+As this support for context suspension using a preempt fence and the resume work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute mode +contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment.

+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the directly +submitted jobs.

+Other VM_BIND use cases +========================

+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debugged) and attached +to GPU via vm_bind interface.

+GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of +binding will require using dma-fence to ensure residency, the GPU page faults +mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries.

+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.

+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.

+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled.

+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +use cases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.

+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature

do not use it and complexity it brings in is probably more than the

performance advantage we get in legacy execbuff case.

+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using

it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma

is active or not.

+VM_BIND UAPI +=============

+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::

i915_vm_bind.rst

Lionel Landwerlin

24 May 24 May

10:08 a.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On 20/05/2022 01:52, Zanoni, Paulo R wrote:

...

On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:

...
VM_BIND design document with description of intended use cases.

v2: Add more documentation and format as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..f1be560d313c --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,304 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================

+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on a +specified address space (VM). These mappings (also referred to as persistent +mappings) will be persistent across multiple GPU submissions (execbuff calls) +issued by the UMD, without user having to provide a list of all required +mappings during each submission (as required by older execbuff mode).

+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace +to specify how the binding/unbinding should sync with other operations +like the GPU job submission. These fences will be timeline 'drm_syncobj's +for non-Compute contexts (See struct drm_i915_vm_bind_ext_timeline_fences). +For Compute contexts, they will be user/memory fences (See struct +drm_i915_vm_bind_ext_user_fence).

+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.

+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.

+VM_BIND features include:

+* Multiple Virtual Address (VA) mappings can map to the same physical pages

of an object (aliasing).

+* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some

use cases will be helpful.

+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this).

+Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff mode of +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence, +no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases:

+"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/)

I would really like to have more details here. The link provided points to new ioctls and we're not very familiar with those yet, so I think you should really clarify the interaction between the new additions here. Having some sample code would be really nice too.

For Mesa at least (and I believe for the other drivers too) we always have a few exported buffers in every execbuf call, and we rely on the implicit synchronization provided by execbuf to make sure everything works. The execbuf ioctl also has some code to flush caches during implicit synchronization AFAIR, so I would guess we rely on it too and whatever else the Kernel does. Is that covered by the new ioctls?

In addition, as far as I remember, one of the big improvements of vm_bind was that it would help reduce ioctl latency and cpu overhead. But if making execbuf faster comes at the cost of requiring additional ioctls calls for implicit synchronization, which is required on ever execbuf call, then I wonder if we'll even get any faster at all. Comparing old execbuf vs plain new execbuf without the new required ioctls won't make sense. But maybe I'm wrong and we won't need to call these new ioctls around every single execbuf ioctl we submit? Again, more clarification and some code examples here would be really nice. This is a big change on an important part of the API, we should clarify the new expected usage.

Hey Paulo,

I think in the case of X11/Wayland, we'll be doing 1 or 2 extra ioctls per frame which seems pretty reasonable.

Essentially we need to set the dependencies on the buffer we´re going to tell the display engine (gnome-shell/kde/bare-display-hw) to use.

In the Vulkan case, we're trading building execbuffer lists of potentially thousands of buffers for every single submission versus 1 or 2 ioctls for a single item when doing vkQueuePresent() (which happens less often than we do execbuffer ioctls).

That seems like a good trade off and doesn't look like a lot more work than explicit fencing where we would have to send associated fences.

Here is the Mesa MR associated with this : https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037

-Lionel

...

...

+This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

+If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later.

IMHO we really need to sort this and check all the assumptions before we commit to any interface. Again, implicit synchronization is something we rely on during *every* execbuf ioctl for most workloads.

...
+In VM_BIND mode, VA allocation is completely managed by the user instead of +the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_).

+So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up +by clearly separating out the functionalities where the VM_BIND mode differs +from older method and they should be moved to separate files.

I seem to recall some conversations where we were told a bunch of ioctls would stop working or make no sense to call when using vm_bind. Can we please get a complete list of those? Bonus points if the Kernel starts telling us we just called something that makes no sense.

...

+VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.

+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.

I know we already discussed this, but just to document it publicly: the ideal case for user space would be that every BO is created as private but then we'd have an ioctl to convert it to non-private (without the need to have a non-private->private interface).

An explanation on why we can't have an ioctl to mark as exported a buffer that was previously vm_private would be really appreciated.

Thanks, Paulo

...

+VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode, the +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page faults manages +residency of backing storage using dma_fence. The VM_BIND mode with page faults +and the system allocator support do not use any dma_fence at all.

+VM_BIND locking order is as below.

+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in

vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the

mapping.

In future, when GPU page faults are supported, we can potentially use a

rwsem instead, so that multiple page fault handlers can take the read side

lock to lookup the mapping and hence can run in parallel.

The older execbuff mode of binding do not need this lock.

+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to

be held while binding/unbinding a vma in the async worker and while updating

dma-resv fence list of an object. Note that private BOs of a VM will all

share a dma-resv object.

The future system allocator support will use the HMM prescribed locking

instead.

+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of

invalidated vmas (due to eviction and userptr invalidation) etc.

+When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we need to +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for +system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm lists, +so won't ever need lock-C.

+VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid +performance degradation. We will also need support for bulk LRU movement of +VM_BIND objects to avoid additional latencies in execbuff path.

+The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call with +that VM). So, bulk LRU movement of page table pages is also needed.

+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture.

+VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency +setting (either through explicit or implicit mechanism).

+When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff submission.

+Also, in VM_BIND mode, use dma-resv apis for determining object activeness +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for igfx.

+Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated).

+Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful.

+VM_BIND Compute support +========================

+User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl.

+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.

+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.

+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.

+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/

+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during +context creation. The dma-fence based user interfaces like gem_wait ioctl and +execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode.

+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption.

+As this support for context suspension using a preempt fence and the resume work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute mode +contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment.

+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the directly +submitted jobs.

+Other VM_BIND use cases +========================

+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debugged) and attached +to GPU via vm_bind interface.

+GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of +binding will require using dma-fence to ensure residency, the GPU page faults +mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries.

+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.

+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.

+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled.

+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +use cases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.

+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature

do not use it and complexity it brings in is probably more than the

performance advantage we get in legacy execbuff case.

+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using

it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma

is active or not.

+VM_BIND UAPI +=============

+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::

i915_scheduler.rst

+.. toctree::

i915_vm_bind.rst

Lionel Landwerlin

1 Jun 1 Jun

2:25 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:

...

+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.

I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.

And the fact that it's happening in an async worker seem to imply it's not immediate.

I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :

VM_BIND (out_fence=fence1)

VM_BIND (out_fence=fence2)

VM_BIND (out_fence=fence3)

In what order are the fences going to be signaled?

In the order of VM_BIND ioctls? Or out of order?

Because you wrote "serialized I assume it's : in order

One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.

In Vulkan VM_BIND operations are serialized but per engine.

So you could have something like this :

VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

fence1 is not signaled

fence3 is signaled

So the second VM_BIND will proceed before the first VM_BIND.

I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.

But then it makes the VM_BIND input fences useless.

Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?

Sorry I noticed this late.

-Lionel

Matthew Brost

8:28 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:

...

On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:

...
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.

I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.

And the fact that it's happening in an async worker seem to imply it's not immediate.

I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :

VM_BIND (out_fence=fence1)

VM_BIND (out_fence=fence2)

VM_BIND (out_fence=fence3)

In what order are the fences going to be signaled?

In the order of VM_BIND ioctls? Or out of order?

Because you wrote "serialized I assume it's : in order

One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.

In Vulkan VM_BIND operations are serialized but per engine.

So you could have something like this :

VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

fence1 is not signaled

fence3 is signaled

So the second VM_BIND will proceed before the first VM_BIND.

I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.

But then it makes the VM_BIND input fences useless.

Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?

My opinion is rework this but make the ordering via an engine param optional.

e.g. A VM can be configured so all binds are ordered within the VM

e.g. A VM can be configured so all binds accept an engine argument (in the case of the i915 likely this is a gem context handle) and binds ordered with respect to that engine.

This gives UMDs options as the later likely consumes more KMD resources so if a different UMD can live with binds being ordered within the VM they can use a mode consuming less resources.

Matt

...

Sorry I noticed this late.

-Lionel

Niranjana Vishwanathapura

2 Jun 2 Jun

8:11 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:

...

On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:

...
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:

...
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.

I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.

And the fact that it's happening in an async worker seem to imply it's not immediate.

Ok, will fix. This was added because in earlier design binding was deferred until next execbuff. But now it is non-deferred (immediate in that sense). But yah, this is confusing and will fix it.

...

...
I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :

VM_BIND (out_fence=fence1)

VM_BIND (out_fence=fence2)

VM_BIND (out_fence=fence3)

In what order are the fences going to be signaled?

In the order of VM_BIND ioctls? Or out of order?

Because you wrote "serialized I assume it's : in order

Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will use the same queue and hence are ordered.

...

...
One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.

In Vulkan VM_BIND operations are serialized but per engine.

So you could have something like this :

VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

fence1 is not signaled

fence3 is signaled

So the second VM_BIND will proceed before the first VM_BIND.

I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.

But then it makes the VM_BIND input fences useless.

Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?

My opinion is rework this but make the ordering via an engine param optional.

e.g. A VM can be configured so all binds are ordered within the VM

e.g. A VM can be configured so all binds accept an engine argument (in the case of the i915 likely this is a gem context handle) and binds ordered with respect to that engine.

This gives UMDs options as the later likely consumes more KMD resources so if a different UMD can live with binds being ordered within the VM they can use a mode consuming less resources.

I think we need to be careful here if we are looking for some out of (submission) order completion of vm_bind/unbind. In-order completion means, in a batch of binds and unbinds to be completed in-order, user only needs to specify in-fence for the first bind/unbind call and the our-fence for the last bind/unbind call. Also, the VA released by an unbind call can be re-used by any subsequent bind call in that in-order batch.

These things will break if binding/unbinding were to be allowed to go out of order (of submission) and user need to be extra careful not to run into pre-mature triggereing of out-fence and bind failing as VA is still in use etc.

Also, VM_BIND binds the provided mapping on the specified address space (VM). So, the uapi is not engine/context specific.

We can however add a 'queue' to the uapi which can be one from the pre-defined queues, I915_VM_BIND_QUEUE_0 I915_VM_BIND_QUEUE_1 ... I915_VM_BIND_QUEUE_(N-1)

KMD will spawn an async work queue for each queue which will only bind the mappings on that queue in the order of submission. User can assign the queue to per engine or anything like that.

But again here, user need to be careful and not deadlock these queues with circular dependency of fences.

I prefer adding this later an as extension based on whether it is really helping with the implementation.

Daniel, any thoughts?

Niranjana

...

Matt

...
Sorry I noticed this late.

-Lionel

Jason Ekstrand

8:35 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura < niranjana.vishwanathapura@intel.com> wrote:

...

On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:

...
On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:

...
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:

...
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the

mapping in an

...
...
...
+async worker. The binding and unbinding will work like a special GPU

engine.

...
...
...
+The binding and unbinding operations are serialized and will wait on

specified

...
...
...
+input fences before the operation and will signal the output fences

upon the

...
...
...
+completion of the operation. Due to serialization, completion of an

operation

...
...
...
+will also indicate that all previous operations are also complete.

I guess we should avoid saying "will immediately start

binding/unbinding" if

...
...
there are fences involved.

And the fact that it's happening in an async worker seem to imply it's

not

...
...
immediate.

Ok, will fix. This was added because in earlier design binding was deferred until next execbuff. But now it is non-deferred (immediate in that sense). But yah, this is confusing and will fix it.

...
...
I have a question on the behavior of the bind operation when no input

fence

...
...
is provided. Let say I do :

VM_BIND (out_fence=fence1)

VM_BIND (out_fence=fence2)

VM_BIND (out_fence=fence3)

In what order are the fences going to be signaled?

In the order of VM_BIND ioctls? Or out of order?

Because you wrote "serialized I assume it's : in order

Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind will use the same queue and hence are ordered.

...
...
One thing I didn't realize is that because we only get one "VM_BIND"

engine,

...
...
there is a disconnect from the Vulkan specification.

In Vulkan VM_BIND operations are serialized but per engine.

So you could have something like this :

VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

fence1 is not signaled

fence3 is signaled

So the second VM_BIND will proceed before the first VM_BIND.

I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.

But then it makes the VM_BIND input fences useless.

Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?

My opinion is rework this but make the ordering via an engine param

optional.

...
e.g. A VM can be configured so all binds are ordered within the VM

e.g. A VM can be configured so all binds accept an engine argument (in the case of the i915 likely this is a gem context handle) and binds ordered with respect to that engine.

This gives UMDs options as the later likely consumes more KMD resources so if a different UMD can live with binds being ordered within the VM they can use a mode consuming less resources.

I think we need to be careful here if we are looking for some out of (submission) order completion of vm_bind/unbind. In-order completion means, in a batch of binds and unbinds to be completed in-order, user only needs to specify in-fence for the first bind/unbind call and the our-fence for the last bind/unbind call. Also, the VA released by an unbind call can be re-used by any subsequent bind call in that in-order batch.

These things will break if binding/unbinding were to be allowed to go out of order (of submission) and user need to be extra careful not to run into pre-mature triggereing of out-fence and bind failing as VA is still in use etc.

Also, VM_BIND binds the provided mapping on the specified address space (VM). So, the uapi is not engine/context specific.

We can however add a 'queue' to the uapi which can be one from the pre-defined queues, I915_VM_BIND_QUEUE_0 I915_VM_BIND_QUEUE_1 ... I915_VM_BIND_QUEUE_(N-1)

KMD will spawn an async work queue for each queue which will only bind the mappings on that queue in the order of submission. User can assign the queue to per engine or anything like that.

But again here, user need to be careful and not deadlock these queues with circular dependency of fences.

I prefer adding this later an as extension based on whether it is really helping with the implementation.

I can tell you right now that having everything on a single in-order queue will not get us the perf we want. What vulkan really wants is one of two things:

1. No implicit ordering of VM_BIND ops. They just happen in whatever their dependencies are resolved and we ensure ordering ourselves by having a syncobj in the VkQueue.

2. The ability to create multiple VM_BIND queues. We need at least 2 but I don't see why there needs to be a limit besides the limits the i915 API already has on the number of engines. Vulkan could expose multiple sparse binding queues to the client if it's not arbitrarily limited.

Why? Because Vulkan has two basic kind of bind operations and we don't want any dependencies between them:

1. Immediate. These happen right after BO creation or maybe as part of vkBindImageMemory() or VkBindBufferMemory(). These don't happen on a queue and we don't want them serialized with anything. To synchronize with submit, we'll have a syncobj in the VkDevice which is signaled by all immediate bind operations and make submits wait on it.

2. Queued (sparse): These happen on a VkQueue which may be the same as a render/compute queue or may be its own queue. It's up to us what we want to advertise. From the Vulkan API PoV, this is like any other queue. Operations on it wait on and signal semaphores. If we have a VM_BIND engine, we'd provide syncobjs to wait and signal just like we do in execbuf().

The important thing is that we don't want one type of operation to block on the other. If immediate binds are blocking on sparse binds, it's going to cause over-synchronization issues.

In terms of the internal implementation, I know that there's going to be a lock on the VM and that we can't actually do these things in parallel. That's fine. Once the dma_fences have signaled and we're unblocked to do the bind operation, I don't care if there's a bit of synchronization due to locking. That's expected. What we can't afford to have is an immediate bind operation suddenly blocking on a sparse operation which is blocked on a compute job that's going to run for another 5ms.

For reference, Windows solves this by allowing arbitrarily many paging queues (what they call a VM_BIND engine/queue). That design works pretty well and solves the problems in question. Again, we could just make everything out-of-order and require using syncobjs to order things as userspace wants. That'd be fine too.

One more note while I'm here: danvet said something on IRC about VM_BIND queues waiting for syncobjs to materialize. We don't really want/need this. We already have all the machinery in userspace to handle wait-before-signal and waiting for syncobj fences to materialize and that machinery is on by default. It would actually take MORE work in Mesa to turn it off and take advantage of the kernel being able to wait for syncobjs to materialize. Also, getting that right is ridiculously hard and I really don't want to get it wrong in kernel space. When we do memory fences, wait-before-signal will be a thing. We don't need to try and make it a thing for syncobj.

--Jason

...

Daniel, any thoughts?

Niranjana

...
Matt

...
Sorry I noticed this late.

-Lionel

Lionel Landwerlin

3 Jun 3 Jun

7:20 a.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On 02/06/2022 23:35, Jason Ekstrand wrote:

...

On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
>> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>> > +VM_BIND/UNBIND ioctl will immediately start
binding/unbinding the mapping in an
>> > +async worker. The binding and unbinding will work like a
special GPU engine.
>> > +The binding and unbinding operations are serialized and will
wait on specified
>> > +input fences before the operation and will signal the output
fences upon the
>> > +completion of the operation. Due to serialization,
completion of an operation
>> > +will also indicate that all previous operations are also
complete.
>>
>> I guess we should avoid saying "will immediately start
binding/unbinding" if
>> there are fences involved.
>>
>> And the fact that it's happening in an async worker seem to
imply it's not
>> immediate.
>>

Ok, will fix.
This was added because in earlier design binding was deferred
until next execbuff.
But now it is non-deferred (immediate in that sense). But yah,
this is confusing
and will fix it.

>>
>> I have a question on the behavior of the bind operation when no
input fence
>> is provided. Let say I do :
>>
>> VM_BIND (out_fence=fence1)
>>
>> VM_BIND (out_fence=fence2)
>>
>> VM_BIND (out_fence=fence3)
>>
>>
>> In what order are the fences going to be signaled?
>>
>> In the order of VM_BIND ioctls? Or out of order?
>>
>> Because you wrote "serialized I assume it's : in order
>>

Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
unbind will use
the same queue and hence are ordered.

>>
>> One thing I didn't realize is that because we only get one
"VM_BIND" engine,
>> there is a disconnect from the Vulkan specification.
>>
>> In Vulkan VM_BIND operations are serialized but per engine.
>>
>> So you could have something like this :
>>
>> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>>
>> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>>
>>
>> fence1 is not signaled
>>
>> fence3 is signaled
>>
>> So the second VM_BIND will proceed before the first VM_BIND.
>>
>>
>> I guess we can deal with that scenario in userspace by doing
the wait
>> ourselves in one thread per engines.
>>
>> But then it makes the VM_BIND input fences useless.
>>
>>
>> Daniel : what do you think? Should be rework this or just deal
with wait
>> fences in userspace?
>>
>
>My opinion is rework this but make the ordering via an engine
param optional.
>
>e.g. A VM can be configured so all binds are ordered within the VM
>
>e.g. A VM can be configured so all binds accept an engine
argument (in
>the case of the i915 likely this is a gem context handle) and binds
>ordered with respect to that engine.
>
>This gives UMDs options as the later likely consumes more KMD
resources
>so if a different UMD can live with binds being ordered within the VM
>they can use a mode consuming less resources.
>

I think we need to be careful here if we are looking for some out of
(submission) order completion of vm_bind/unbind.
In-order completion means, in a batch of binds and unbinds to be
completed in-order, user only needs to specify in-fence for the
first bind/unbind call and the our-fence for the last bind/unbind
call. Also, the VA released by an unbind call can be re-used by
any subsequent bind call in that in-order batch.

These things will break if binding/unbinding were to be allowed to
go out of order (of submission) and user need to be extra careful
not to run into pre-mature triggereing of out-fence and bind failing
as VA is still in use etc.

Also, VM_BIND binds the provided mapping on the specified address
space
(VM). So, the uapi is not engine/context specific.

We can however add a 'queue' to the uapi which can be one from the
pre-defined queues,
I915_VM_BIND_QUEUE_0
I915_VM_BIND_QUEUE_1
...
I915_VM_BIND_QUEUE_(N-1)

KMD will spawn an async work queue for each queue which will only
bind the mappings on that queue in the order of submission.
User can assign the queue to per engine or anything like that.

But again here, user need to be careful and not deadlock these
queues with circular dependency of fences.

I prefer adding this later an as extension based on whether it
is really helping with the implementation.
I can tell you right now that having everything on a single in-order queue will not get us the perf we want. What vulkan really wants is one of two things:

1. No implicit ordering of VM_BIND ops. They just happen in whatever their dependencies are resolved and we ensure ordering ourselves by having a syncobj in the VkQueue.

2. The ability to create multiple VM_BIND queues. We need at least 2 but I don't see why there needs to be a limit besides the limits the i915 API already has on the number of engines. Vulkan could expose multiple sparse binding queues to the client if it's not arbitrarily limited.

Why? Because Vulkan has two basic kind of bind operations and we don't want any dependencies between them:

1. Immediate. These happen right after BO creation or maybe as part of vkBindImageMemory() or VkBindBufferMemory(). These don't happen on a queue and we don't want them serialized with anything. To synchronize with submit, we'll have a syncobj in the VkDevice which is signaled by all immediate bind operations and make submits wait on it.

2. Queued (sparse): These happen on a VkQueue which may be the same as a render/compute queue or may be its own queue. It's up to us what we want to advertise. From the Vulkan API PoV, this is like any other queue. Operations on it wait on and signal semaphores. If we have a VM_BIND engine, we'd provide syncobjs to wait and signal just like we do in execbuf().

The important thing is that we don't want one type of operation to block on the other. If immediate binds are blocking on sparse binds, it's going to cause over-synchronization issues.

In terms of the internal implementation, I know that there's going to be a lock on the VM and that we can't actually do these things in parallel. That's fine. Once the dma_fences have signaled and we're unblocked to do the bind operation, I don't care if there's a bit of synchronization due to locking. That's expected. What we can't afford to have is an immediate bind operation suddenly blocking on a sparse operation which is blocked on a compute job that's going to run for another 5ms.

For reference, Windows solves this by allowing arbitrarily many paging queues (what they call a VM_BIND engine/queue). That design works pretty well and solves the problems in question. Again, we could just make everything out-of-order and require using syncobjs to order things as userspace wants. That'd be fine too.

One more note while I'm here: danvet said something on IRC about VM_BIND queues waiting for syncobjs to materialize. We don't really want/need this. We already have all the machinery in userspace to handle wait-before-signal and waiting for syncobj fences to materialize and that machinery is on by default. It would actually take MORE work in Mesa to turn it off and take advantage of the kernel being able to wait for syncobjs to materialize. Also, getting that right is ridiculously hard and I really don't want to get it wrong in kernel space. When we do memory fences, wait-before-signal will be a thing. We don't need to try and make it a thing for syncobj.

--Jason

Thanks Jason,

I missed the bit in the Vulkan spec that we're allowed to have a sparse queue that does not implement either graphics or compute operations :

"While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT support in queue families that also include

graphics and compute support, other implementations may only expose a VK_QUEUE_SPARSE_BINDING_BIT-only queue

family."

So it can all be all a vm_bind engine that just does bind/unbind operations.

But yes we need another engine for the immediate/non-sparse operations.

-Lionel

...

Daniel, any thoughts?

Niranjana

>Matt
>
>>
>> Sorry I noticed this late.
>>
>>
>> -Lionel
>>
>>

Niranjana Vishwanathapura

11:51 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:

...

On 02/06/2022 23:35, Jason Ekstrand wrote:

On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
<niranjana.vishwanathapura@intel.com> wrote:

  On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
  >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
  >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
  >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding
  the mapping in an
  >> > +async worker. The binding and unbinding will work like a special
  GPU engine.
  >> > +The binding and unbinding operations are serialized and will
  wait on specified
  >> > +input fences before the operation and will signal the output
  fences upon the
  >> > +completion of the operation. Due to serialization, completion of
  an operation
  >> > +will also indicate that all previous operations are also
  complete.
  >>
  >> I guess we should avoid saying "will immediately start
  binding/unbinding" if
  >> there are fences involved.
  >>
  >> And the fact that it's happening in an async worker seem to imply
  it's not
  >> immediate.
  >>

  Ok, will fix.
  This was added because in earlier design binding was deferred until
  next execbuff.
  But now it is non-deferred (immediate in that sense). But yah, this is
  confusing
  and will fix it.

  >>
  >> I have a question on the behavior of the bind operation when no
  input fence
  >> is provided. Let say I do :
  >>
  >> VM_BIND (out_fence=fence1)
  >>
  >> VM_BIND (out_fence=fence2)
  >>
  >> VM_BIND (out_fence=fence3)
  >>
  >>
  >> In what order are the fences going to be signaled?
  >>
  >> In the order of VM_BIND ioctls? Or out of order?
  >>
  >> Because you wrote "serialized I assume it's : in order
  >>

  Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and unbind
  will use
  the same queue and hence are ordered.

  >>
  >> One thing I didn't realize is that because we only get one
  "VM_BIND" engine,
  >> there is a disconnect from the Vulkan specification.
  >>
  >> In Vulkan VM_BIND operations are serialized but per engine.
  >>
  >> So you could have something like this :
  >>
  >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
  >>
  >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
  >>
  >>
  >> fence1 is not signaled
  >>
  >> fence3 is signaled
  >>
  >> So the second VM_BIND will proceed before the first VM_BIND.
  >>
  >>
  >> I guess we can deal with that scenario in userspace by doing the
  wait
  >> ourselves in one thread per engines.
  >>
  >> But then it makes the VM_BIND input fences useless.
  >>
  >>
  >> Daniel : what do you think? Should be rework this or just deal with
  wait
  >> fences in userspace?
  >>
  >
  >My opinion is rework this but make the ordering via an engine param
  optional.
  >
  >e.g. A VM can be configured so all binds are ordered within the VM
  >
  >e.g. A VM can be configured so all binds accept an engine argument
  (in
  >the case of the i915 likely this is a gem context handle) and binds
  >ordered with respect to that engine.
  >
  >This gives UMDs options as the later likely consumes more KMD
  resources
  >so if a different UMD can live with binds being ordered within the VM
  >they can use a mode consuming less resources.
  >

  I think we need to be careful here if we are looking for some out of
  (submission) order completion of vm_bind/unbind.
  In-order completion means, in a batch of binds and unbinds to be
  completed in-order, user only needs to specify in-fence for the
  first bind/unbind call and the our-fence for the last bind/unbind
  call. Also, the VA released by an unbind call can be re-used by
  any subsequent bind call in that in-order batch.

  These things will break if binding/unbinding were to be allowed to
  go out of order (of submission) and user need to be extra careful
  not to run into pre-mature triggereing of out-fence and bind failing
  as VA is still in use etc.

  Also, VM_BIND binds the provided mapping on the specified address
  space
  (VM). So, the uapi is not engine/context specific.

  We can however add a 'queue' to the uapi which can be one from the
  pre-defined queues,
  I915_VM_BIND_QUEUE_0
  I915_VM_BIND_QUEUE_1
  ...
  I915_VM_BIND_QUEUE_(N-1)

  KMD will spawn an async work queue for each queue which will only
  bind the mappings on that queue in the order of submission.
  User can assign the queue to per engine or anything like that.

  But again here, user need to be careful and not deadlock these
  queues with circular dependency of fences.

  I prefer adding this later an as extension based on whether it
  is really helping with the implementation.

I can tell you right now that having everything on a single in-order
queue will not get us the perf we want.  What vulkan really wants is one
of two things:
 1. No implicit ordering of VM_BIND ops.  They just happen in whatever
their dependencies are resolved and we ensure ordering ourselves by
having a syncobj in the VkQueue.
 2. The ability to create multiple VM_BIND queues.  We need at least 2
but I don't see why there needs to be a limit besides the limits the
i915 API already has on the number of engines.  Vulkan could expose
multiple sparse binding queues to the client if it's not arbitrarily
limited.

Thanks Jason, Lionel.

Jason, what are you referring to when you say "limits the i915 API already has on the number of engines"? I am not sure if there is such an uapi today.

I am trying to see how many queues we need and don't want it to be arbitrarily large and unduely blow up memory usage and complexity in i915 driver.

...

Why?  Because Vulkan has two basic kind of bind operations and we don't
want any dependencies between them:
 1. Immediate.  These happen right after BO creation or maybe as part of
vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a
queue and we don't want them serialized with anything.  To synchronize
with submit, we'll have a syncobj in the VkDevice which is signaled by
all immediate bind operations and make submits wait on it.
 2. Queued (sparse): These happen on a VkQueue which may be the same as
a render/compute queue or may be its own queue.  It's up to us what we
want to advertise.  From the Vulkan API PoV, this is like any other
queue.  Operations on it wait on and signal semaphores.  If we have a
VM_BIND engine, we'd provide syncobjs to wait and signal just like we do
in execbuf().
The important thing is that we don't want one type of operation to block
on the other.  If immediate binds are blocking on sparse binds, it's
going to cause over-synchronization issues.
In terms of the internal implementation, I know that there's going to be
a lock on the VM and that we can't actually do these things in
parallel.  That's fine.  Once the dma_fences have signaled and we're

Thats correct. It is like a single VM_BIND engine with multiple queues feeding to it.

...

unblocked to do the bind operation, I don't care if there's a bit of
synchronization due to locking.  That's expected.  What we can't afford
to have is an immediate bind operation suddenly blocking on a sparse
operation which is blocked on a compute job that's going to run for
another 5ms.

As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND on other VMs. I am not sure about usecases here, but just wanted to clarify.

Niranjana

...

For reference, Windows solves this by allowing arbitrarily many paging
queues (what they call a VM_BIND engine/queue).  That design works
pretty well and solves the problems in question.  Again, we could just
make everything out-of-order and require using syncobjs to order things
as userspace wants. That'd be fine too.
One more note while I'm here: danvet said something on IRC about VM_BIND
queues waiting for syncobjs to materialize.  We don't really want/need
this.  We already have all the machinery in userspace to handle
wait-before-signal and waiting for syncobj fences to materialize and
that machinery is on by default.  It would actually take MORE work in
Mesa to turn it off and take advantage of the kernel being able to wait
for syncobjs to materialize.  Also, getting that right is ridiculously
hard and I really don't want to get it wrong in kernel space.  When we
do memory fences, wait-before-signal will be a thing.  We don't need to
try and make it a thing for syncobj.
--Jason

Thanks Jason,

I missed the bit in the Vulkan spec that we're allowed to have a sparse queue that does not implement either graphics or compute operations :

"While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
support in queue families that also include

 graphics and compute support, other implementations may only expose a
VK_QUEUE_SPARSE_BINDING_BIT-only queue

 family."

So it can all be all a vm_bind engine that just does bind/unbind operations.

But yes we need another engine for the immediate/non-sparse operations.

-Lionel

  Daniel, any thoughts?

  Niranjana

  >Matt
  >
  >>
  >> Sorry I noticed this late.
  >>
  >>
  >> -Lionel
  >>
  >>

Jason Ekstrand

7 Jun 7 Jun

5:12 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura < niranjana.vishwanathapura@intel.com> wrote:

...

On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:

...

On 02/06/2022 23:35, Jason Ekstrand wrote:

On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
<niranjana.vishwanathapura@intel.com> wrote:

  On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
  >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:
  >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
  >> > +VM_BIND/UNBIND ioctl will immediately start binding/unbinding
  the mapping in an
  >> > +async worker. The binding and unbinding will work like a

special

...

  GPU engine.
  >> > +The binding and unbinding operations are serialized and will
  wait on specified
  >> > +input fences before the operation and will signal the output
  fences upon the
  >> > +completion of the operation. Due to serialization,

completion of

...

  an operation
  >> > +will also indicate that all previous operations are also
  complete.
  >>
  >> I guess we should avoid saying "will immediately start
  binding/unbinding" if
  >> there are fences involved.
  >>
  >> And the fact that it's happening in an async worker seem to

imply

...

  it's not
  >> immediate.
  >>

  Ok, will fix.
  This was added because in earlier design binding was deferred until
  next execbuff.
  But now it is non-deferred (immediate in that sense). But yah,

this is

...

  confusing
  and will fix it.

  >>
  >> I have a question on the behavior of the bind operation when no
  input fence
  >> is provided. Let say I do :
  >>
  >> VM_BIND (out_fence=fence1)
  >>
  >> VM_BIND (out_fence=fence2)
  >>
  >> VM_BIND (out_fence=fence3)
  >>
  >>
  >> In what order are the fences going to be signaled?
  >>
  >> In the order of VM_BIND ioctls? Or out of order?
  >>
  >> Because you wrote "serialized I assume it's : in order
  >>

  Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and

unbind

...

  will use
  the same queue and hence are ordered.

  >>
  >> One thing I didn't realize is that because we only get one
  "VM_BIND" engine,
  >> there is a disconnect from the Vulkan specification.
  >>
  >> In Vulkan VM_BIND operations are serialized but per engine.
  >>
  >> So you could have something like this :
  >>
  >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
  >>
  >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
  >>
  >>
  >> fence1 is not signaled
  >>
  >> fence3 is signaled
  >>
  >> So the second VM_BIND will proceed before the first VM_BIND.
  >>
  >>
  >> I guess we can deal with that scenario in userspace by doing the
  wait
  >> ourselves in one thread per engines.
  >>
  >> But then it makes the VM_BIND input fences useless.
  >>
  >>
  >> Daniel : what do you think? Should be rework this or just deal

with

...

  wait
  >> fences in userspace?
  >>
  >
  >My opinion is rework this but make the ordering via an engine

param

...

  optional.
  >
  >e.g. A VM can be configured so all binds are ordered within the VM
  >
  >e.g. A VM can be configured so all binds accept an engine argument
  (in
  >the case of the i915 likely this is a gem context handle) and

binds

...

  >ordered with respect to that engine.
  >
  >This gives UMDs options as the later likely consumes more KMD
  resources
  >so if a different UMD can live with binds being ordered within

the VM

...

  >they can use a mode consuming less resources.
  >

  I think we need to be careful here if we are looking for some out

...

  (submission) order completion of vm_bind/unbind.
  In-order completion means, in a batch of binds and unbinds to be
  completed in-order, user only needs to specify in-fence for the
  first bind/unbind call and the our-fence for the last bind/unbind
  call. Also, the VA released by an unbind call can be re-used by
  any subsequent bind call in that in-order batch.

  These things will break if binding/unbinding were to be allowed to
  go out of order (of submission) and user need to be extra careful
  not to run into pre-mature triggereing of out-fence and bind

failing

...

  as VA is still in use etc.

  Also, VM_BIND binds the provided mapping on the specified address
  space
  (VM). So, the uapi is not engine/context specific.

  We can however add a 'queue' to the uapi which can be one from the
  pre-defined queues,
  I915_VM_BIND_QUEUE_0
  I915_VM_BIND_QUEUE_1
  ...
  I915_VM_BIND_QUEUE_(N-1)

  KMD will spawn an async work queue for each queue which will only
  bind the mappings on that queue in the order of submission.
  User can assign the queue to per engine or anything like that.

  But again here, user need to be careful and not deadlock these
  queues with circular dependency of fences.

  I prefer adding this later an as extension based on whether it
  is really helping with the implementation.

I can tell you right now that having everything on a single in-order
queue will not get us the perf we want.  What vulkan really wants is

one

...

of two things:
 1. No implicit ordering of VM_BIND ops.  They just happen in

whatever

...

their dependencies are resolved and we ensure ordering ourselves by
having a syncobj in the VkQueue.
 2. The ability to create multiple VM_BIND queues.  We need at least

...

but I don't see why there needs to be a limit besides the limits the
i915 API already has on the number of engines.  Vulkan could expose
multiple sparse binding queues to the client if it's not arbitrarily
limited.

Thanks Jason, Lionel.

Jason, what are you referring to when you say "limits the i915 API already has on the number of engines"? I am not sure if there is such an uapi today.

There's a limit of something like 64 total engines today based on the number of bits we can cram into the exec flags in execbuffer2. I think someone had an extended version that allowed more but I ripped it out because no one was using it. Of course, execbuffer3 might not have that problem at all.

I am trying to see how many queues we need and don't want it to be

...

arbitrarily large and unduely blow up memory usage and complexity in i915 driver.

I expect a Vulkan driver to use at most 2 in the vast majority of cases. I could imagine a client wanting to create more than 1 sparse queue in which case, it'll be N+1 but that's unlikely. As far as complexity goes, once you allow two, I don't think the complexity is going up by allowing N. As for memory usage, creating more queues means more memory. That's a trade-off that userspace can make. Again, the expected number here is 1 or 2 in the vast majority of cases so I don't think you need to worry.

...

Why?  Because Vulkan has two basic kind of bind operations and we

don't

...

want any dependencies between them:
 1. Immediate.  These happen right after BO creation or maybe as

part of

...

vkBindImageMemory() or VkBindBufferMemory().  These don't happen on a
queue and we don't want them serialized with anything.  To

synchronize

...

with submit, we'll have a syncobj in the VkDevice which is signaled

...

all immediate bind operations and make submits wait on it.
 2. Queued (sparse): These happen on a VkQueue which may be the same

...

a render/compute queue or may be its own queue.  It's up to us what

...

want to advertise.  From the Vulkan API PoV, this is like any other
queue.  Operations on it wait on and signal semaphores.  If we have a
VM_BIND engine, we'd provide syncobjs to wait and signal just like

we do

...

in execbuf().
The important thing is that we don't want one type of operation to

block

...

on the other.  If immediate binds are blocking on sparse binds, it's
going to cause over-synchronization issues.
In terms of the internal implementation, I know that there's going

to be

...

a lock on the VM and that we can't actually do these things in
parallel.  That's fine.  Once the dma_fences have signaled and we're

Thats correct. It is like a single VM_BIND engine with multiple queues feeding to it.

Right. As long as the queues themselves are independent and can block on dma_fences without holding up other queues, I think we're fine.

...

...
unblocked to do the bind operation, I don't care if there's a bit of
synchronization due to locking.  That's expected.  What we can't
afford

...
to have is an immediate bind operation suddenly blocking on a sparse
operation which is blocked on a compute job that's going to run for
another 5ms.
As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND on other VMs. I am not sure about usecases here, but just wanted to clarify.

Yes, that's what I would expect.

--Jason

...

Niranjana

...

For reference, Windows solves this by allowing arbitrarily many

paging

...

queues (what they call a VM_BIND engine/queue).  That design works
pretty well and solves the problems in question.  Again, we could

just

...

make everything out-of-order and require using syncobjs to order

things

...

as userspace wants. That'd be fine too.
One more note while I'm here: danvet said something on IRC about

VM_BIND

...

queues waiting for syncobjs to materialize.  We don't really

want/need

...

this.  We already have all the machinery in userspace to handle
wait-before-signal and waiting for syncobj fences to materialize and
that machinery is on by default.  It would actually take MORE work in
Mesa to turn it off and take advantage of the kernel being able to

wait

...

for syncobjs to materialize.  Also, getting that right is

ridiculously

...

hard and I really don't want to get it wrong in kernel space.  When

...

do memory fences, wait-before-signal will be a thing.  We don't need

...

try and make it a thing for syncobj.
--Jason
Thanks Jason,

I missed the bit in the Vulkan spec that we're allowed to have a sparse queue that does not implement either graphics or compute operations :
"While some implementations may include VK_QUEUE_SPARSE_BINDING_BIT
support in queue families that also include

 graphics and compute support, other implementations may only expose

...

VK_QUEUE_SPARSE_BINDING_BIT-only queue

 family."
So it can all be all a vm_bind engine that just does bind/unbind operations.

But yes we need another engine for the immediate/non-sparse operations.

-Lionel
  Daniel, any thoughts?

  Niranjana

  >Matt
  >
  >>
  >> Sorry I noticed this late.
  >>
  >>
  >> -Lionel
  >>
  >>

Niranjana Vishwanathapura

6:18 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:

...

On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:
>   On 02/06/2022 23:35, Jason Ekstrand wrote:
>
>     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>     <niranjana.vishwanathapura@intel.com> wrote:
>
>       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
>       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
wrote:
>       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>       >> > +VM_BIND/UNBIND ioctl will immediately start
binding/unbinding
>       the mapping in an
>       >> > +async worker. The binding and unbinding will work like a
special
>       GPU engine.
>       >> > +The binding and unbinding operations are serialized and
will
>       wait on specified
>       >> > +input fences before the operation and will signal the
output
>       fences upon the
>       >> > +completion of the operation. Due to serialization,
completion of
>       an operation
>       >> > +will also indicate that all previous operations are also
>       complete.
>       >>
>       >> I guess we should avoid saying "will immediately start
>       binding/unbinding" if
>       >> there are fences involved.
>       >>
>       >> And the fact that it's happening in an async worker seem to
imply
>       it's not
>       >> immediate.
>       >>
>
>       Ok, will fix.
>       This was added because in earlier design binding was deferred
until
>       next execbuff.
>       But now it is non-deferred (immediate in that sense). But yah,
this is
>       confusing
>       and will fix it.
>
>       >>
>       >> I have a question on the behavior of the bind operation when
no
>       input fence
>       >> is provided. Let say I do :
>       >>
>       >> VM_BIND (out_fence=fence1)
>       >>
>       >> VM_BIND (out_fence=fence2)
>       >>
>       >> VM_BIND (out_fence=fence3)
>       >>
>       >>
>       >> In what order are the fences going to be signaled?
>       >>
>       >> In the order of VM_BIND ioctls? Or out of order?
>       >>
>       >> Because you wrote "serialized I assume it's : in order
>       >>
>
>       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and
unbind
>       will use
>       the same queue and hence are ordered.
>
>       >>
>       >> One thing I didn't realize is that because we only get one
>       "VM_BIND" engine,
>       >> there is a disconnect from the Vulkan specification.
>       >>
>       >> In Vulkan VM_BIND operations are serialized but per engine.
>       >>
>       >> So you could have something like this :
>       >>
>       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
>       >>
>       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
>       >>
>       >>
>       >> fence1 is not signaled
>       >>
>       >> fence3 is signaled
>       >>
>       >> So the second VM_BIND will proceed before the first VM_BIND.
>       >>
>       >>
>       >> I guess we can deal with that scenario in userspace by doing
the
>       wait
>       >> ourselves in one thread per engines.
>       >>
>       >> But then it makes the VM_BIND input fences useless.
>       >>
>       >>
>       >> Daniel : what do you think? Should be rework this or just
deal with
>       wait
>       >> fences in userspace?
>       >>
>       >
>       >My opinion is rework this but make the ordering via an engine
param
>       optional.
>       >
>       >e.g. A VM can be configured so all binds are ordered within the
VM
>       >
>       >e.g. A VM can be configured so all binds accept an engine
argument
>       (in
>       >the case of the i915 likely this is a gem context handle) and
binds
>       >ordered with respect to that engine.
>       >
>       >This gives UMDs options as the later likely consumes more KMD
>       resources
>       >so if a different UMD can live with binds being ordered within
the VM
>       >they can use a mode consuming less resources.
>       >
>
>       I think we need to be careful here if we are looking for some
out of
>       (submission) order completion of vm_bind/unbind.
>       In-order completion means, in a batch of binds and unbinds to be
>       completed in-order, user only needs to specify in-fence for the
>       first bind/unbind call and the our-fence for the last
bind/unbind
>       call. Also, the VA released by an unbind call can be re-used by
>       any subsequent bind call in that in-order batch.
>
>       These things will break if binding/unbinding were to be allowed
to
>       go out of order (of submission) and user need to be extra
careful
>       not to run into pre-mature triggereing of out-fence and bind
failing
>       as VA is still in use etc.
>
>       Also, VM_BIND binds the provided mapping on the specified
address
>       space
>       (VM). So, the uapi is not engine/context specific.
>
>       We can however add a 'queue' to the uapi which can be one from
the
>       pre-defined queues,
>       I915_VM_BIND_QUEUE_0
>       I915_VM_BIND_QUEUE_1
>       ...
>       I915_VM_BIND_QUEUE_(N-1)
>
>       KMD will spawn an async work queue for each queue which will
only
>       bind the mappings on that queue in the order of submission.
>       User can assign the queue to per engine or anything like that.
>
>       But again here, user need to be careful and not deadlock these
>       queues with circular dependency of fences.
>
>       I prefer adding this later an as extension based on whether it
>       is really helping with the implementation.
>
>     I can tell you right now that having everything on a single
in-order
>     queue will not get us the perf we want.  What vulkan really wants
is one
>     of two things:
>      1. No implicit ordering of VM_BIND ops.  They just happen in
whatever
>     their dependencies are resolved and we ensure ordering ourselves
by
>     having a syncobj in the VkQueue.
>      2. The ability to create multiple VM_BIND queues.  We need at
least 2
>     but I don't see why there needs to be a limit besides the limits
the
>     i915 API already has on the number of engines.  Vulkan could
expose
>     multiple sparse binding queues to the client if it's not
arbitrarily
>     limited.

Thanks Jason, Lionel.

Jason, what are you referring to when you say "limits the i915 API
already
has on the number of engines"? I am not sure if there is such an uapi
today.

Thanks Jason. Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE and somehow export it to user (I am thinking of embedding it in I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n queues.

...

I am trying to see how many queues we need and don't want it to be
arbitrarily
large and unduely blow up memory usage and complexity in i915 driver.
I expect a Vulkan driver to use at most 2 in the vast majority of cases. I could imagine a client wanting to create more than 1 sparse queue in which case, it'll be N+1 but that's unlikely. As far as complexity goes, once you allow two, I don't think the complexity is going up by allowing N. As for memory usage, creating more queues means more memory. That's a trade-off that userspace can make. Again, the expected number here is 1 or 2 in the vast majority of cases so I don't think you need to worry.

Ok, will start with n=3 meaning 8 queues. That would require us create 8 workqueues. We can change 'n' later if required.

Niranjana

...

>     Why?  Because Vulkan has two basic kind of bind operations and we
don't
>     want any dependencies between them:
>      1. Immediate.  These happen right after BO creation or maybe as
part of
>     vkBindImageMemory() or VkBindBufferMemory().  These don't happen
on a
>     queue and we don't want them serialized with anything.  To
synchronize
>     with submit, we'll have a syncobj in the VkDevice which is
signaled by
>     all immediate bind operations and make submits wait on it.
>      2. Queued (sparse): These happen on a VkQueue which may be the
same as
>     a render/compute queue or may be its own queue.  It's up to us
what we
>     want to advertise.  From the Vulkan API PoV, this is like any
other
>     queue.  Operations on it wait on and signal semaphores.  If we
have a
>     VM_BIND engine, we'd provide syncobjs to wait and signal just like
we do
>     in execbuf().
>     The important thing is that we don't want one type of operation to
block
>     on the other.  If immediate binds are blocking on sparse binds,
it's
>     going to cause over-synchronization issues.
>     In terms of the internal implementation, I know that there's going
to be
>     a lock on the VM and that we can't actually do these things in
>     parallel.  That's fine.  Once the dma_fences have signaled and
we're

Thats correct. It is like a single VM_BIND engine with multiple queues
feeding to it.

Right. As long as the queues themselves are independent and can block on dma_fences without holding up other queues, I think we're fine.

>     unblocked to do the bind operation, I don't care if there's a bit
of
>     synchronization due to locking.  That's expected.  What we can't
afford
>     to have is an immediate bind operation suddenly blocking on a
sparse
>     operation which is blocked on a compute job that's going to run
for
>     another 5ms.

As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the
VM_BIND
on other VMs. I am not sure about usecases here, but just wanted to
clarify.

Yes, that's what I would expect. --Jason

Niranjana

>     For reference, Windows solves this by allowing arbitrarily many
paging
>     queues (what they call a VM_BIND engine/queue).  That design works
>     pretty well and solves the problems in question.  Again, we could
just
>     make everything out-of-order and require using syncobjs to order
things
>     as userspace wants. That'd be fine too.
>     One more note while I'm here: danvet said something on IRC about
VM_BIND
>     queues waiting for syncobjs to materialize.  We don't really
want/need
>     this.  We already have all the machinery in userspace to handle
>     wait-before-signal and waiting for syncobj fences to materialize
and
>     that machinery is on by default.  It would actually take MORE work
in
>     Mesa to turn it off and take advantage of the kernel being able to
wait
>     for syncobjs to materialize.  Also, getting that right is
ridiculously
>     hard and I really don't want to get it wrong in kernel space. 
When we
>     do memory fences, wait-before-signal will be a thing.  We don't
need to
>     try and make it a thing for syncobj.
>     --Jason
>
>   Thanks Jason,
>
>   I missed the bit in the Vulkan spec that we're allowed to have a
sparse
>   queue that does not implement either graphics or compute operations
:
>
>     "While some implementations may include
VK_QUEUE_SPARSE_BINDING_BIT
>     support in queue families that also include
>
>      graphics and compute support, other implementations may only
expose a
>     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>
>      family."
>
>   So it can all be all a vm_bind engine that just does bind/unbind
>   operations.
>
>   But yes we need another engine for the immediate/non-sparse
operations.
>
>   -Lionel
>
>     
>
>       Daniel, any thoughts?
>
>       Niranjana
>
>       >Matt
>       >
>       >>
>       >> Sorry I noticed this late.
>       >>
>       >>
>       >> -Lionel
>       >>
>       >>

Niranjana Vishwanathapura

9:32 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:

...

On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:

...

On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:

...

On 02/06/2022 23:35, Jason Ekstrand wrote:

On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
<niranjana.vishwanathapura@intel.com> wrote:

  On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:
  >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin

wrote:

...

  >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
  >> > +VM_BIND/UNBIND ioctl will immediately start

binding/unbinding

...

  the mapping in an
  >> > +async worker. The binding and unbinding will work like a

special

...

  GPU engine.
  >> > +The binding and unbinding operations are serialized and

will

...

  wait on specified
  >> > +input fences before the operation and will signal the

output

...

  fences upon the
  >> > +completion of the operation. Due to serialization,

completion of

...

  an operation
  >> > +will also indicate that all previous operations are also
  complete.
  >>
  >> I guess we should avoid saying "will immediately start
  binding/unbinding" if
  >> there are fences involved.
  >>
  >> And the fact that it's happening in an async worker seem to

imply

...

  it's not
  >> immediate.
  >>

  Ok, will fix.
  This was added because in earlier design binding was deferred

until

...

  next execbuff.
  But now it is non-deferred (immediate in that sense). But yah,

this is

...

  confusing
  and will fix it.

  >>
  >> I have a question on the behavior of the bind operation when

...

  input fence
  >> is provided. Let say I do :
  >>
  >> VM_BIND (out_fence=fence1)
  >>
  >> VM_BIND (out_fence=fence2)
  >>
  >> VM_BIND (out_fence=fence3)
  >>
  >>
  >> In what order are the fences going to be signaled?
  >>
  >> In the order of VM_BIND ioctls? Or out of order?
  >>
  >> Because you wrote "serialized I assume it's : in order
  >>

  Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and

unbind

...

  will use
  the same queue and hence are ordered.

  >>
  >> One thing I didn't realize is that because we only get one
  "VM_BIND" engine,
  >> there is a disconnect from the Vulkan specification.
  >>
  >> In Vulkan VM_BIND operations are serialized but per engine.
  >>
  >> So you could have something like this :
  >>
  >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
  >>
  >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
  >>
  >>
  >> fence1 is not signaled
  >>
  >> fence3 is signaled
  >>
  >> So the second VM_BIND will proceed before the first VM_BIND.
  >>
  >>
  >> I guess we can deal with that scenario in userspace by doing

the

...

  wait
  >> ourselves in one thread per engines.
  >>
  >> But then it makes the VM_BIND input fences useless.
  >>
  >>
  >> Daniel : what do you think? Should be rework this or just

deal with

...

  wait
  >> fences in userspace?
  >>
  >
  >My opinion is rework this but make the ordering via an engine

param

...

  optional.
  >
  >e.g. A VM can be configured so all binds are ordered within the

...

  >
  >e.g. A VM can be configured so all binds accept an engine

argument

...

  (in
  >the case of the i915 likely this is a gem context handle) and

binds

...

  >ordered with respect to that engine.
  >
  >This gives UMDs options as the later likely consumes more KMD
  resources
  >so if a different UMD can live with binds being ordered within

the VM

...

  >they can use a mode consuming less resources.
  >

  I think we need to be careful here if we are looking for some

out of

...

  (submission) order completion of vm_bind/unbind.
  In-order completion means, in a batch of binds and unbinds to be
  completed in-order, user only needs to specify in-fence for the
  first bind/unbind call and the our-fence for the last

bind/unbind

...

  call. Also, the VA released by an unbind call can be re-used by
  any subsequent bind call in that in-order batch.

  These things will break if binding/unbinding were to be allowed

...

  go out of order (of submission) and user need to be extra

careful

...

  not to run into pre-mature triggereing of out-fence and bind

failing

...

  as VA is still in use etc.

  Also, VM_BIND binds the provided mapping on the specified

address

...

  space
  (VM). So, the uapi is not engine/context specific.

  We can however add a 'queue' to the uapi which can be one from

the

...

  pre-defined queues,
  I915_VM_BIND_QUEUE_0
  I915_VM_BIND_QUEUE_1
  ...
  I915_VM_BIND_QUEUE_(N-1)

  KMD will spawn an async work queue for each queue which will

only

...

  bind the mappings on that queue in the order of submission.
  User can assign the queue to per engine or anything like that.

  But again here, user need to be careful and not deadlock these
  queues with circular dependency of fences.

  I prefer adding this later an as extension based on whether it
  is really helping with the implementation.

I can tell you right now that having everything on a single

in-order

...

queue will not get us the perf we want.  What vulkan really wants

is one

...

of two things:
 1. No implicit ordering of VM_BIND ops.  They just happen in

whatever

...

their dependencies are resolved and we ensure ordering ourselves

...

having a syncobj in the VkQueue.
 2. The ability to create multiple VM_BIND queues.  We need at

least 2

...

but I don't see why there needs to be a limit besides the limits

the

...

i915 API already has on the number of engines.  Vulkan could

expose

...

multiple sparse binding queues to the client if it's not

arbitrarily

...

limited.

Thanks Jason, Lionel.

Jason, what are you referring to when you say "limits the i915 API already has on the number of engines"? I am not sure if there is such an uapi today.

Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3 will also have. So, we can simply define in vm_bind/unbind structures,

#define I915_VM_BIND_MAX_QUEUE 64 __u32 queue;

I think that will keep things simple.

Niranjana

...

...
I am trying to see how many queues we need and don't want it to be arbitrarily large and unduely blow up memory usage and complexity in i915 driver.

I expect a Vulkan driver to use at most 2 in the vast majority of cases. I could imagine a client wanting to create more than 1 sparse queue in which case, it'll be N+1 but that's unlikely. As far as complexity goes, once you allow two, I don't think the complexity is going up by allowing N. As for memory usage, creating more queues means more memory. That's a trade-off that userspace can make. Again, the expected number here is 1 or 2 in the vast majority of cases so I don't think you need to worry.

Ok, will start with n=3 meaning 8 queues. That would require us create 8 workqueues. We can change 'n' later if required.

Niranjana

...
...
Why?  Because Vulkan has two basic kind of bind operations and we
don't

...
want any dependencies between them:
 1. Immediate.  These happen right after BO creation or maybe as
part of

...
vkBindImageMemory() or VkBindBufferMemory().  These don't happen
on a

...
queue and we don't want them serialized with anything.  To
synchronize

...
with submit, we'll have a syncobj in the VkDevice which is
signaled by

...
all immediate bind operations and make submits wait on it.
 2. Queued (sparse): These happen on a VkQueue which may be the
same as

...
a render/compute queue or may be its own queue.  It's up to us
what we

...
want to advertise.  From the Vulkan API PoV, this is like any
other

...
queue.  Operations on it wait on and signal semaphores.  If we
have a

...
VM_BIND engine, we'd provide syncobjs to wait and signal just like
we do

...
in execbuf().
The important thing is that we don't want one type of operation to
block

...
on the other.  If immediate binds are blocking on sparse binds,
it's

...
going to cause over-synchronization issues.
In terms of the internal implementation, I know that there's going
to be

...
a lock on the VM and that we can't actually do these things in
parallel.  That's fine.  Once the dma_fences have signaled and
we're

Thats correct. It is like a single VM_BIND engine with multiple queues feeding to it.

Right. As long as the queues themselves are independent and can block on dma_fences without holding up other queues, I think we're fine.

...
unblocked to do the bind operation, I don't care if there's a bit
of

...
synchronization due to locking.  That's expected.  What we can't
afford

...
to have is an immediate bind operation suddenly blocking on a
sparse

...
operation which is blocked on a compute job that's going to run
for

...
another 5ms.
As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND on other VMs. I am not sure about usecases here, but just wanted to clarify.

Yes, that's what I would expect. --Jason

Niranjana

...
For reference, Windows solves this by allowing arbitrarily many
paging

...
queues (what they call a VM_BIND engine/queue).  That design works
pretty well and solves the problems in question.  Again, we could
just

...
make everything out-of-order and require using syncobjs to order
things

...
as userspace wants. That'd be fine too.
One more note while I'm here: danvet said something on IRC about
VM_BIND

...
queues waiting for syncobjs to materialize.  We don't really
want/need

...
this.  We already have all the machinery in userspace to handle
wait-before-signal and waiting for syncobj fences to materialize
and

...
that machinery is on by default.  It would actually take MORE work
in

...
Mesa to turn it off and take advantage of the kernel being able to
wait

...
for syncobjs to materialize.  Also, getting that right is
ridiculously

...
hard and I really don't want to get it wrong in kernel 
space. When we

...
do memory fences, wait-before-signal will be a thing.  We don't
need to

...
try and make it a thing for syncobj.
--Jason
Thanks Jason,

I missed the bit in the Vulkan spec that we're allowed to have a
sparse

...
queue that does not implement either graphics or compute operations

:

...
"While some implementations may include
VK_QUEUE_SPARSE_BINDING_BIT

...
support in queue families that also include

 graphics and compute support, other implementations may only
expose a

...
VK_QUEUE_SPARSE_BINDING_BIT-only queue

 family."
So it can all be all a vm_bind engine that just does bind/unbind operations.

But yes we need another engine for the immediate/non-sparse
operations.

...
-Lionel
    >
  Daniel, any thoughts?

  Niranjana

  >Matt
  >
  >>
  >> Sorry I noticed this late.
  >>
  >>
  >> -Lionel
  >>
  >>

Tvrtko Ursulin

8 Jun 8 Jun

7:33 a.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:

...

On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:

...
On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:

...
On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:    >   On 02/06/2022 23:35, Jason Ekstrand wrote:    >    >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura    >     niranjana.vishwanathapura@intel.com wrote:    >    >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:    >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin    wrote:    >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:    >       >> > +VM_BIND/UNBIND ioctl will immediately start    binding/unbinding    >       the mapping in an    >       >> > +async worker. The binding and unbinding will work like a    special    >       GPU engine.    >       >> > +The binding and unbinding operations are serialized and    will    >       wait on specified    >       >> > +input fences before the operation and will signal the    output    >       fences upon the    >       >> > +completion of the operation. Due to serialization,    completion of    >       an operation    >       >> > +will also indicate that all previous operations are also    >       complete.    >       >>    >       >> I guess we should avoid saying "will immediately start    >       binding/unbinding" if    >       >> there are fences involved.    >       >>    >       >> And the fact that it's happening in an async worker seem to    imply    >       it's not    >       >> immediate.    >       >>    >    >       Ok, will fix.    >       This was added because in earlier design binding was deferred    until    >       next execbuff.    >       But now it is non-deferred (immediate in that sense). But yah,    this is    >       confusing    >       and will fix it.    >    >       >>    >       >> I have a question on the behavior of the bind operation when    no    >       input fence    >       >> is provided. Let say I do :    >       >>    >       >> VM_BIND (out_fence=fence1)    >       >>    >       >> VM_BIND (out_fence=fence2)    >       >>    >       >> VM_BIND (out_fence=fence3)    >       >>    >       >>    >       >> In what order are the fences going to be signaled?    >       >>    >       >> In the order of VM_BIND ioctls? Or out of order?    >       >>    >       >> Because you wrote "serialized I assume it's : in order    >       >>    >    >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and    unbind    >       will use    >       the same queue and hence are ordered.    >    >       >>    >       >> One thing I didn't realize is that because we only get one    >       "VM_BIND" engine,    >       >> there is a disconnect from the Vulkan specification.    >       >>    >       >> In Vulkan VM_BIND operations are serialized but per engine.    >       >>    >       >> So you could have something like this :    >       >>    >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)    >       >>    >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)    >       >>    >       >>    >       >> fence1 is not signaled    >       >>    >       >> fence3 is signaled    >       >>    >       >> So the second VM_BIND will proceed before the first VM_BIND.    >       >>    >       >>    >       >> I guess we can deal with that scenario in userspace by doing    the    >       wait    >       >> ourselves in one thread per engines.    >       >>    >       >> But then it makes the VM_BIND input fences useless.    >       >>    >       >>    >       >> Daniel : what do you think? Should be rework this or just    deal with    >       wait    >       >> fences in userspace?    >       >>    >       >    >       >My opinion is rework this but make the ordering via an engine    param    >       optional.    >       >    >       >e.g. A VM can be configured so all binds are ordered within the    VM    >       >    >       >e.g. A VM can be configured so all binds accept an engine    argument    >       (in    >       >the case of the i915 likely this is a gem context handle) and    binds    >       >ordered with respect to that engine.    >       >    >       >This gives UMDs options as the later likely consumes more KMD    >       resources    >       >so if a different UMD can live with binds being ordered within    the VM    >       >they can use a mode consuming less resources.    >       >    >    >       I think we need to be careful here if we are looking for some    out of    >       (submission) order completion of vm_bind/unbind.    >       In-order completion means, in a batch of binds and unbinds to be    >       completed in-order, user only needs to specify in-fence for the    >       first bind/unbind call and the our-fence for the last    bind/unbind    >       call. Also, the VA released by an unbind call can be re-used by    >       any subsequent bind call in that in-order batch.    >    >       These things will break if binding/unbinding were to be allowed    to    >       go out of order (of submission) and user need to be extra    careful    >       not to run into pre-mature triggereing of out-fence and bind    failing    >       as VA is still in use etc.    >    >       Also, VM_BIND binds the provided mapping on the specified    address    >       space    >       (VM). So, the uapi is not engine/context specific.    >    >       We can however add a 'queue' to the uapi which can be one from    the    >       pre-defined queues,    >       I915_VM_BIND_QUEUE_0    >       I915_VM_BIND_QUEUE_1    >       ...    >       I915_VM_BIND_QUEUE_(N-1)    >    >       KMD will spawn an async work queue for each queue which will    only    >       bind the mappings on that queue in the order of submission.    >       User can assign the queue to per engine or anything like that.    >    >       But again here, user need to be careful and not deadlock these    >       queues with circular dependency of fences.    >    >       I prefer adding this later an as extension based on whether it    >       is really helping with the implementation.    >    >     I can tell you right now that having everything on a single    in-order    >     queue will not get us the perf we want. What vulkan really wants    is one    >     of two things:    >      1. No implicit ordering of VM_BIND ops. They just happen in    whatever    >     their dependencies are resolved and we ensure ordering ourselves    by    >     having a syncobj in the VkQueue.    >      2. The ability to create multiple VM_BIND queues. We need at    least 2    >     but I don't see why there needs to be a limit besides the limits    the    >     i915 API already has on the number of engines. Vulkan could    expose    >     multiple sparse binding queues to the client if it's not    arbitrarily    >     limited.

Thanks Jason, Lionel.

Jason, what are you referring to when you say "limits the i915 API    already    has on the number of engines"? I am not sure if there is such an uapi    today.

There's a limit of something like 64 total engines today based on the number of bits we can cram into the exec flags in execbuffer2. I think someone had an extended version that allowed more but I ripped it out because no one was using it. Of course, execbuffer3 might not have that problem at all.

Thanks Jason. Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE and somehow export it to user (I am thinking of embedding it in I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n queues.

Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3 will also have. So, we can simply define in vm_bind/unbind structures,

#define I915_VM_BIND_MAX_QUEUE   64        __u32 queue;

I think that will keep things simple.

Hmmm? What does execbuf2 limit has to do with how many engines hardware can have? I suggest not to do that.

Change with added this:

if (set.num_engines > I915_EXEC_RING_MASK + 1) return -EINVAL;

To context creation needs to be undone and so let users create engine maps with all hardware engines, and let execbuf3 access them all.

Regards,

Tvrtko

...

Niranjana

...
...
I am trying to see how many queues we need and don't want it to be    arbitrarily    large and unduely blow up memory usage and complexity in i915 driver.

I expect a Vulkan driver to use at most 2 in the vast majority of cases. I could imagine a client wanting to create more than 1 sparse queue in which case, it'll be N+1 but that's unlikely. As far as complexity goes, once you allow two, I don't think the complexity is going up by allowing N. As for memory usage, creating more queues means more memory. That's a trade-off that userspace can make. Again, the expected number here is 1 or 2 in the vast majority of cases so I don't think you need to worry.

Ok, will start with n=3 meaning 8 queues. That would require us create 8 workqueues. We can change 'n' later if required.

Niranjana

...
>     Why? Because Vulkan has two basic kind of bind operations and we    don't    >     want any dependencies between them:    >      1. Immediate. These happen right after BO creation or maybe as    part of    >     vkBindImageMemory() or VkBindBufferMemory(). These don't happen    on a    >     queue and we don't want them serialized with anything. To    synchronize    >     with submit, we'll have a syncobj in the VkDevice which is    signaled by    >     all immediate bind operations and make submits wait on it.    >      2. Queued (sparse): These happen on a VkQueue which may be the    same as    >     a render/compute queue or may be its own queue. It's up to us    what we    >     want to advertise. From the Vulkan API PoV, this is like any    other    >     queue. Operations on it wait on and signal semaphores. If we    have a    >     VM_BIND engine, we'd provide syncobjs to wait and signal just like    we do    >     in execbuf().    >     The important thing is that we don't want one type of operation to    block    >     on the other. If immediate binds are blocking on sparse binds,    it's    >     going to cause over-synchronization issues.    >     In terms of the internal implementation, I know that there's going    to be    >     a lock on the VM and that we can't actually do these things in    >     parallel. That's fine. Once the dma_fences have signaled and    we're

Thats correct. It is like a single VM_BIND engine with multiple queues    feeding to it.

Right. As long as the queues themselves are independent and can block on dma_fences without holding up other queues, I think we're fine.

>     unblocked to do the bind operation, I don't care if there's a bit    of    >     synchronization due to locking. That's expected. What we can't    afford    >     to have is an immediate bind operation suddenly blocking on a    sparse    >     operation which is blocked on a compute job that's going to run    for    >     another 5ms.

As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the    VM_BIND    on other VMs. I am not sure about usecases here, but just wanted to    clarify.

Yes, that's what I would expect. --Jason

Niranjana

>     For reference, Windows solves this by allowing arbitrarily many    paging    >     queues (what they call a VM_BIND engine/queue). That design works    >     pretty well and solves the problems in question. Again, we could    just    >     make everything out-of-order and require using syncobjs to order    things    >     as userspace wants. That'd be fine too.    >     One more note while I'm here: danvet said something on IRC about    VM_BIND    >     queues waiting for syncobjs to materialize. We don't really    want/need    >     this. We already have all the machinery in userspace to handle    >     wait-before-signal and waiting for syncobj fences to materialize    and    >     that machinery is on by default. It would actually take MORE work    in    >     Mesa to turn it off and take advantage of the kernel being able to    wait    >     for syncobjs to materialize. Also, getting that right is    ridiculously    >     hard and I really don't want to get it wrong in kernel space.     When we    >     do memory fences, wait-before-signal will be a thing. We don't    need to    >     try and make it a thing for syncobj.    >     --Jason    >    >   Thanks Jason,    >    >   I missed the bit in the Vulkan spec that we're allowed to have a    sparse    >   queue that does not implement either graphics or compute operations    :    >    >     "While some implementations may include    VK_QUEUE_SPARSE_BINDING_BIT    >     support in queue families that also include    >    >      graphics and compute support, other implementations may only    expose a    >     VK_QUEUE_SPARSE_BINDING_BIT-only queue    >    >      family."    >    >   So it can all be all a vm_bind engine that just does bind/unbind    >   operations.    >    >   But yes we need another engine for the immediate/non-sparse    operations.    >    >   -Lionel    >    >         >    >       Daniel, any thoughts?    >    >       Niranjana    >    >       >Matt    >       >    >       >>    >       >> Sorry I noticed this late.    >       >>    >       >>    >       >> -Lionel    >       >>    >       >>

Niranjana Vishwanathapura

9:44 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:

...

On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:

...
On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura wrote:

...
On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:

...
On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:    >   On 02/06/2022 23:35, Jason Ekstrand wrote:    >    >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura    >     niranjana.vishwanathapura@intel.com wrote:    >    >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew Brost wrote:    >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin    wrote:    >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:    >       >> > +VM_BIND/UNBIND ioctl will immediately start    binding/unbinding    >       the mapping in an    >       >> > +async worker. The binding and unbinding will work like a    special    >       GPU engine.    >       >> > +The binding and unbinding operations are serialized and    will    >       wait on specified    >       >> > +input fences before the operation and will signal the    output    >       fences upon the    >       >> > +completion of the operation. Due to serialization,    completion of    >       an operation    >       >> > +will also indicate that all previous operations are also    >       complete.    >       >>    >       >> I guess we should avoid saying "will immediately start    >       binding/unbinding" if    >       >> there are fences involved.    >       >>    >       >> And the fact that it's happening in an async worker seem to    imply    >       it's not    >       >> immediate.    >       >>    >    >       Ok, will fix.    >       This was added because in earlier design binding was deferred    until    >       next execbuff.    >       But now it is non-deferred (immediate in that sense). But yah,    this is    >       confusing    >       and will fix it.    >    >       >>    >       >> I have a question on the behavior of the bind operation when    no    >       input fence    >       >> is provided. Let say I do :    >       >>    >       >> VM_BIND (out_fence=fence1)    >       >>    >       >> VM_BIND (out_fence=fence2)    >       >>    >       >> VM_BIND (out_fence=fence3)    >       >>    >       >>    >       >> In what order are the fences going to be signaled?    >       >>    >       >> In the order of VM_BIND ioctls? Or out of order?    >       >>    >       >> Because you wrote "serialized I assume it's : in order    >       >>    >    >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that bind and    unbind    >       will use    >       the same queue and hence are ordered.    >    >       >>    >       >> One thing I didn't realize is that because we only get one    >       "VM_BIND" engine,    >       >> there is a disconnect from the Vulkan specification.    >       >>    >       >> In Vulkan VM_BIND operations are serialized but per engine.    >       >>    >       >> So you could have something like this :    >       >>    >       >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)    >       >>    >       >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)    >       >>    >       >>    >       >> fence1 is not signaled    >       >>    >       >> fence3 is signaled    >       >>    >       >> So the second VM_BIND will proceed before the first VM_BIND.    >       >>    >       >>    >       >> I guess we can deal with that scenario in userspace by doing    the    >       wait    >       >> ourselves in one thread per engines.    >       >>    >       >> But then it makes the VM_BIND input fences useless.    >       >>    >       >>    >       >> Daniel : what do you think? Should be rework this or just    deal with    >       wait    >       >> fences in userspace?    >       >>    >       >    >       >My opinion is rework this but make the ordering via an engine    param    >       optional.    >       >    >       >e.g. A VM can be configured so all binds are ordered within the    VM    >       >    >       >e.g. A VM can be configured so all binds accept an engine    argument    >       (in    >       >the case of the i915 likely this is a gem context handle) and    binds    >       >ordered with respect to that engine.    >       >    >       >This gives UMDs options as the later likely consumes more KMD    >       resources    >       >so if a different UMD can live with binds being ordered within    the VM    >       >they can use a mode consuming less resources.    >       >    >    >       I think we need to be careful here if we are looking for some    out of    >       (submission) order completion of vm_bind/unbind.    >       In-order completion means, in a batch of binds and unbinds to be    >       completed in-order, user only needs to specify in-fence for the    >       first bind/unbind call and the our-fence for the last    bind/unbind    >       call. Also, the VA released by an unbind call can be re-used by    >       any subsequent bind call in that in-order batch.    >    >       These things will break if binding/unbinding were to be allowed    to    >       go out of order (of submission) and user need to be extra    careful    >       not to run into pre-mature triggereing of out-fence and bind    failing    >       as VA is still in use etc.    >    >       Also, VM_BIND binds the provided mapping on the specified    address    >       space    >       (VM). So, the uapi is not engine/context specific.    >    >       We can however add a 'queue' to the uapi which can be one from    the    >       pre-defined queues,    >       I915_VM_BIND_QUEUE_0    >       I915_VM_BIND_QUEUE_1    >       ...    >       I915_VM_BIND_QUEUE_(N-1)    >    >       KMD will spawn an async work queue for each queue which will    only    >       bind the mappings on that queue in the order of submission.    >       User can assign the queue to per engine or anything like that.    >    >       But again here, user need to be careful and not deadlock these    >       queues with circular dependency of fences.    >    >       I prefer adding this later an as extension based on whether it    >       is really helping with the implementation.    >    >     I can tell you right now that having everything on a single    in-order    >     queue will not get us the perf we want. What vulkan really wants    is one    >     of two things:    >      1. No implicit ordering of VM_BIND ops. They just happen in    whatever    >     their dependencies are resolved and we ensure ordering ourselves    by    >     having a syncobj in the VkQueue.    >      2. The ability to create multiple VM_BIND queues. We need at    least 2    >     but I don't see why there needs to be a limit besides the limits    the    >     i915 API already has on the number of engines. Vulkan could    expose    >     multiple sparse binding queues to the client if it's not    arbitrarily    >     limited.

Thanks Jason, Lionel.

Jason, what are you referring to when you say "limits the i915 API    already    has on the number of engines"? I am not sure if there is such an uapi    today.

There's a limit of something like 64 total engines today based on the number of bits we can cram into the exec flags in execbuffer2. I think someone had an extended version that allowed more but I ripped it out because no one was using it. Of course, execbuffer3 might not have that problem at all.

Thanks Jason. Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE and somehow export it to user (I am thinking of embedding it in I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n queues.

Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which execbuf3 will also have. So, we can simply define in vm_bind/unbind structures,

#define I915_VM_BIND_MAX_QUEUE   64        __u32 queue;

I think that will keep things simple.

Hmmm? What does execbuf2 limit has to do with how many engines hardware can have? I suggest not to do that.

Change with added this:

if (set.num_engines > I915_EXEC_RING_MASK + 1) return -EINVAL;

To context creation needs to be undone and so let users create engine maps with all hardware engines, and let execbuf3 access them all.

Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also. Hence, I was using the same limit for VM_BIND queues (64, or 65 if we make it N+1). But, as discussed in other thread of this RFC series, we are planning to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be any uapi that limits the number of engines (and hence the vm_bind queues need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large (__u32 queue_idx) then, we need to have a hashmap for queue (a wq, work_item and a linked list) lookup from the user specified queue index. Other option is to just put some hard limit (say 64 or 65) and use an array of queues in VM (each created upon first use). I prefer this.

Niranjana

...

Regards,

Tvrtko

...
Niranjana

...
...
I am trying to see how many queues we need and don't want it to be    arbitrarily    large and unduely blow up memory usage and complexity in i915 driver.

I expect a Vulkan driver to use at most 2 in the vast majority of cases. I could imagine a client wanting to create more than 1 sparse queue in which case, it'll be N+1 but that's unlikely. As far as complexity goes, once you allow two, I don't think the complexity is going up by allowing N. As for memory usage, creating more queues means more memory. That's a trade-off that userspace can make. Again, the expected number here is 1 or 2 in the vast majority of cases so I don't think you need to worry.

Ok, will start with n=3 meaning 8 queues. That would require us create 8 workqueues. We can change 'n' later if required.

Niranjana

...
>     Why? Because Vulkan has two basic kind of bind operations and we    don't    >     want any dependencies between them:    >      1. Immediate. These happen right after BO creation or maybe as    part of    >     vkBindImageMemory() or VkBindBufferMemory(). These don't happen    on a    >     queue and we don't want them serialized with anything. To    synchronize    >     with submit, we'll have a syncobj in the VkDevice which is    signaled by    >     all immediate bind operations and make submits wait on it.    >      2. Queued (sparse): These happen on a VkQueue which may be the    same as    >     a render/compute queue or may be its own queue. It's up to us    what we    >     want to advertise. From the Vulkan API PoV, this is like any    other    >     queue. Operations on it wait on and signal semaphores. If we    have a    >     VM_BIND engine, we'd provide syncobjs to wait and signal just like    we do    >     in execbuf().    >     The important thing is that we don't want one type of operation to    block    >     on the other. If immediate binds are blocking on sparse binds,    it's    >     going to cause over-synchronization issues.    >     In terms of the internal implementation, I know that there's going    to be    >     a lock on the VM and that we can't actually do these things in    >     parallel. That's fine. Once the dma_fences have signaled and    we're

Thats correct. It is like a single VM_BIND engine with multiple queues    feeding to it.

Right. As long as the queues themselves are independent and can block on dma_fences without holding up other queues, I think we're fine.

>     unblocked to do the bind operation, I don't care if there's a bit    of    >     synchronization due to locking. That's expected. What we can't    afford    >     to have is an immediate bind operation suddenly blocking on a    sparse    >     operation which is blocked on a compute job that's going to run    for    >     another 5ms.

As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the    VM_BIND    on other VMs. I am not sure about usecases here, but just wanted to    clarify.

Yes, that's what I would expect. --Jason

Niranjana

>     For reference, Windows solves this by allowing arbitrarily many    paging    >     queues (what they call a VM_BIND engine/queue). That design works    >     pretty well and solves the problems in question. Again, we could    just    >     make everything out-of-order and require using syncobjs to order    things    >     as userspace wants. That'd be fine too.    >     One more note while I'm here: danvet said something on IRC about    VM_BIND    >     queues waiting for syncobjs to materialize. We don't really    want/need    >     this. We already have all the machinery in userspace to handle    >     wait-before-signal and waiting for syncobj fences to materialize    and    >     that machinery is on by default. It would actually take MORE work    in    >     Mesa to turn it off and take advantage of the kernel being able to    wait    >     for syncobjs to materialize. Also, getting that right is    ridiculously    >     hard and I really don't want to get it wrong in kernel space.     When we    >     do memory fences, wait-before-signal will be a thing. We don't    need to    >     try and make it a thing for syncobj.    >     --Jason    >    >   Thanks Jason,    >    >   I missed the bit in the Vulkan spec that we're allowed to have a    sparse    >   queue that does not implement either graphics or compute operations    :    >    >     "While some implementations may include    VK_QUEUE_SPARSE_BINDING_BIT    >     support in queue families that also include    >    >      graphics and compute support, other implementations may only    expose a    >     VK_QUEUE_SPARSE_BINDING_BIT-only queue    >    >      family."    >    >   So it can all be all a vm_bind engine that just does bind/unbind    >   operations.    >    >   But yes we need another engine for the immediate/non-sparse    operations.    >    >   -Lionel    >    >         >    >       Daniel, any thoughts?    >    >       Niranjana    >    >       >Matt    >       >    >       >>    >       >> Sorry I noticed this late.    >       >>    >       >>    >       >> -Lionel    >       >>    >       >>

Jason Ekstrand

9:55 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura < niranjana.vishwanathapura@intel.com> wrote:

...

On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:

...

On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:

...
On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura

wrote:

...

...
...
On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:

...
On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin wrote:

...
On 02/06/2022 23:35, Jason Ekstrand wrote:
On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
<niranjana.vishwanathapura@intel.com> wrote:

  On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
Brost wrote:

...
  >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin
wrote:

...
  >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
  >> > +VM_BIND/UNBIND ioctl will immediately start
binding/unbinding

...
  the mapping in an
  >> > +async worker. The binding and unbinding will
work like a special

...
  GPU engine.
  >> > +The binding and unbinding operations are serialized

and

...

will

...

  wait on specified
  >> > +input fences before the operation and will signal the

output

...

  fences upon the
  >> > +completion of the operation. Due to serialization,

completion of

...

  an operation
  >> > +will also indicate that all previous operations

are also

...

  complete.
  >>
  >> I guess we should avoid saying "will immediately start
  binding/unbinding" if
  >> there are fences involved.
  >>
  >> And the fact that it's happening in an async

worker seem to imply

...

  it's not
  >> immediate.
  >>

  Ok, will fix.
  This was added because in earlier design binding was

deferred

...

until

...

  next execbuff.
  But now it is non-deferred (immediate in that sense).

But yah, this is

...

  confusing
  and will fix it.

  >>
  >> I have a question on the behavior of the bind

operation when no

...

  input fence
  >> is provided. Let say I do :
  >>
  >> VM_BIND (out_fence=fence1)
  >>
  >> VM_BIND (out_fence=fence2)
  >>
  >> VM_BIND (out_fence=fence3)
  >>
  >>
  >> In what order are the fences going to be signaled?
  >>
  >> In the order of VM_BIND ioctls? Or out of order?
  >>
  >> Because you wrote "serialized I assume it's : in order
  >>

  Yes, in the order of VM_BIND/UNBIND ioctls. Note that

bind and unbind

...

  will use
  the same queue and hence are ordered.

  >>
  >> One thing I didn't realize is that because we only get

one

...

  "VM_BIND" engine,
  >> there is a disconnect from the Vulkan specification.
  >>
  >> In Vulkan VM_BIND operations are serialized but

per engine.

...

  >>
  >> So you could have something like this :
  >>
  >> VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)
  >>
  >> VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)
  >>
  >>
  >> fence1 is not signaled
  >>
  >> fence3 is signaled
  >>
  >> So the second VM_BIND will proceed before the

first VM_BIND.

...

  >>
  >>
  >> I guess we can deal with that scenario in

userspace by doing the

...

  wait
  >> ourselves in one thread per engines.
  >>
  >> But then it makes the VM_BIND input fences useless.
  >>
  >>
  >> Daniel : what do you think? Should be rework this or just

deal with

...

  wait
  >> fences in userspace?
  >>
  >
  >My opinion is rework this but make the ordering via

an engine param

...

  optional.
  >
  >e.g. A VM can be configured so all binds are ordered

within the VM

...

  >
  >e.g. A VM can be configured so all binds accept an engine

argument

...

  (in
  >the case of the i915 likely this is a gem context

handle) and binds

...

  >ordered with respect to that engine.
  >
  >This gives UMDs options as the later likely consumes

more KMD

...

  resources
  >so if a different UMD can live with binds being

ordered within the VM

...

  >they can use a mode consuming less resources.
  >

  I think we need to be careful here if we are looking for

some

...

out of

...

  (submission) order completion of vm_bind/unbind.
  In-order completion means, in a batch of binds and

unbinds to be

...

  completed in-order, user only needs to specify

in-fence for the

...

  first bind/unbind call and the our-fence for the last

bind/unbind

...

  call. Also, the VA released by an unbind call can be

re-used by

...

  any subsequent bind call in that in-order batch.

  These things will break if binding/unbinding were to

be allowed to

...

  go out of order (of submission) and user need to be extra

careful

...

  not to run into pre-mature triggereing of out-fence and bind

failing

...

  as VA is still in use etc.

  Also, VM_BIND binds the provided mapping on the specified

address

...

  space
  (VM). So, the uapi is not engine/context specific.

  We can however add a 'queue' to the uapi which can be

one from the

...

  pre-defined queues,
  I915_VM_BIND_QUEUE_0
  I915_VM_BIND_QUEUE_1
  ...
  I915_VM_BIND_QUEUE_(N-1)

  KMD will spawn an async work queue for each queue which will

only

...

  bind the mappings on that queue in the order of submission.
  User can assign the queue to per engine or anything

like that.

...

  But again here, user need to be careful and not

deadlock these

...

  queues with circular dependency of fences.

  I prefer adding this later an as extension based on

whether it

...

  is really helping with the implementation.

I can tell you right now that having everything on a single

in-order

...

queue will not get us the perf we want.  What vulkan

really wants is one

...

of two things:
 1. No implicit ordering of VM_BIND ops.  They just happen in

whatever

...

their dependencies are resolved and we ensure ordering

ourselves by

...

having a syncobj in the VkQueue.
 2. The ability to create multiple VM_BIND queues.  We need at

least 2

...

but I don't see why there needs to be a limit besides

the limits the

...

i915 API already has on the number of engines.  Vulkan could

expose

...

multiple sparse binding queues to the client if it's not

arbitrarily

...

limited.

Thanks Jason, Lionel.

Jason, what are you referring to when you say "limits the i915 API already has on the number of engines"? I am not sure if there is such an

uapi

...

...
...
...
today.

There's a limit of something like 64 total engines today based on the number of bits we can cram into the exec flags in execbuffer2. I

think

...

...
...
...
someone had an extended version that allowed more but I ripped it out because no one was using it. Of course, execbuffer3 might not have that problem at all.

Thanks Jason. Ok, I am not sure which exec flag is that, but yah, execbuffer3 probably will not have this limiation. So, we need to define a VM_BIND_MAX_QUEUE and somehow export it to user (I am thinking of embedding it in I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning 2^n queues.

Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which

execbuf3

Yup! That's exactly the limit I was talking about.

...

...
...
will also have. So, we can simply define in vm_bind/unbind structures,

#define I915_VM_BIND_MAX_QUEUE 64 __u32 queue;

I think that will keep things simple.

Hmmm? What does execbuf2 limit has to do with how many engines hardware can have? I suggest not to do that.

Change with added this:
  if (set.num_engines > I915_EXEC_RING_MASK + 1)
          return -EINVAL;
To context creation needs to be undone and so let users create engine maps with all hardware engines, and let execbuf3 access them all.
Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also. Hence, I was using the same limit for VM_BIND queues (64, or 65 if we make it N+1). But, as discussed in other thread of this RFC series, we are planning to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be any uapi that limits the number of engines (and hence the vm_bind queues need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large (__u32 queue_idx) then, we need to have a hashmap for queue (a wq, work_item and a linked list) lookup from the user specified queue index. Other option is to just put some hard limit (say 64 or 65) and use an array of queues in VM (each created upon first use). I prefer this.

I don't get why a VM_BIND queue is any different from any other queue or userspace-visible kernel object. But I'll leave those details up to danvet or whoever else might be reviewing the implementation.

--Jason

...

Niranjana

...
Regards,

Tvrtko

...
Niranjana

...
...
I am trying to see how many queues we need and don't want it to be arbitrarily large and unduely blow up memory usage and complexity in i915

driver.

...
...
...
...
I expect a Vulkan driver to use at most 2 in the vast majority of cases. I could imagine a client wanting to create more than 1 sparse queue in which case, it'll be N+1 but that's unlikely. As far as complexity goes, once you allow two, I don't think the complexity is going up by allowing N. As for memory usage, creating more queues means more memory. That's a trade-off that userspace can make. Again, the expected number here is 1 or 2 in the vast majority of cases so I don't think you need to worry.

Ok, will start with n=3 meaning 8 queues. That would require us create 8 workqueues. We can change 'n' later if required.

Niranjana

...
...
Why?  Because Vulkan has two basic kind of bind
operations and we don't

...
want any dependencies between them:
 1. Immediate.  These happen right after BO creation or
maybe as part of

...
vkBindImageMemory() or VkBindBufferMemory().  These
don't happen on a

...
queue and we don't want them serialized with anything.  To
synchronize

...
with submit, we'll have a syncobj in the VkDevice which is
signaled by

...
all immediate bind operations and make submits wait on it.
 2. Queued (sparse): These happen on a VkQueue which may be
the

...
...
...
...
same as

...
a render/compute queue or may be its own queue.  It's up to us
what we

...
want to advertise.  From the Vulkan API PoV, this is like any
other

...
queue.  Operations on it wait on and signal semaphores.  If we
have a

...
VM_BIND engine, we'd provide syncobjs to wait and
signal just like we do

...
in execbuf().
The important thing is that we don't want one type of
operation to block

...
on the other.  If immediate binds are blocking on sparse
binds,

...
...
...
...
it's

...
going to cause over-synchronization issues.
In terms of the internal implementation, I know that
there's going to be

...
a lock on the VM and that we can't actually do these things in
parallel.  That's fine.  Once the dma_fences have signaled and
we're

Thats correct. It is like a single VM_BIND engine with multiple queues feeding to it.

Right. As long as the queues themselves are independent and can block on dma_fences without holding up other queues, I think we're fine.

...
unblocked to do the bind operation, I don't care if
there's a bit of

...
synchronization due to locking.  That's expected.  What
we can't afford

...
to have is an immediate bind operation suddenly blocking on a
sparse

...
operation which is blocked on a compute job that's going to
run

...
...
...
...
for

...
another 5ms.
As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the VM_BIND on other VMs. I am not sure about usecases here, but just wanted to clarify.

Yes, that's what I would expect. --Jason

Niranjana

...
For reference, Windows solves this by allowing arbitrarily
many

...
...
...
...
paging

...
queues (what they call a VM_BIND engine/queue).  That
design works

...
pretty well and solves the problems in question.
Again, we could just

...
make everything out-of-order and require using syncobjs
to order things

...
as userspace wants. That'd be fine too.
One more note while I'm here: danvet said something on
IRC about VM_BIND

...
queues waiting for syncobjs to materialize.  We don't really
want/need

...
this.  We already have all the machinery in userspace to
handle

...
...
...
...
...
wait-before-signal and waiting for syncobj fences to
materialize and

...
that machinery is on by default.  It would actually
take MORE work in

...
Mesa to turn it off and take advantage of the kernel
being able to wait

...
for syncobjs to materialize.  Also, getting that right is
ridiculously

...
hard and I really don't want to get it wrong in kernel
space. When we

...
do memory fences, wait-before-signal will be a thing.  We
don't

...
...
...
...
need to

...
try and make it a thing for syncobj.
--Jason
Thanks Jason,

I missed the bit in the Vulkan spec that we're allowed to have a
sparse

...
queue that does not implement either graphics or compute

operations :

...
"While some implementations may include
VK_QUEUE_SPARSE_BINDING_BIT

...
support in queue families that also include

 graphics and compute support, other implementations may only
expose a

...
VK_QUEUE_SPARSE_BINDING_BIT-only queue

 family."
So it can all be all a vm_bind engine that just does bind/unbind operations.

But yes we need another engine for the immediate/non-sparse
operations.

...
-Lionel
    >
  Daniel, any thoughts?

  Niranjana

  >Matt
  >
  >>
  >> Sorry I noticed this late.
  >>
  >>
  >> -Lionel
  >>
  >>

Niranjana Vishwanathapura

10:48 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Wed, Jun 08, 2022 at 04:55:38PM -0500, Jason Ekstrand wrote:

...

On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>
>
>On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura
wrote:
>>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>
>>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
wrote:
>>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>   >
>>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
>>>>   >
>>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>>Brost wrote:
>>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
Landwerlin
>>>>   wrote:
>>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:
>>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>   binding/unbinding
>>>>   >       the mapping in an
>>>>   >       >> > +async worker. The binding and unbinding will
>>>>work like a
>>>>   special
>>>>   >       GPU engine.
>>>>   >       >> > +The binding and unbinding operations are serialized
and
>>>>   will
>>>>   >       wait on specified
>>>>   >       >> > +input fences before the operation and will signal
the
>>>>   output
>>>>   >       fences upon the
>>>>   >       >> > +completion of the operation. Due to serialization,
>>>>   completion of
>>>>   >       an operation
>>>>   >       >> > +will also indicate that all previous operations
>>>>are also
>>>>   >       complete.
>>>>   >       >>
>>>>   >       >> I guess we should avoid saying "will immediately start
>>>>   >       binding/unbinding" if
>>>>   >       >> there are fences involved.
>>>>   >       >>
>>>>   >       >> And the fact that it's happening in an async
>>>>worker seem to
>>>>   imply
>>>>   >       it's not
>>>>   >       >> immediate.
>>>>   >       >>
>>>>   >
>>>>   >       Ok, will fix.
>>>>   >       This was added because in earlier design binding was
deferred
>>>>   until
>>>>   >       next execbuff.
>>>>   >       But now it is non-deferred (immediate in that sense).
>>>>But yah,
>>>>   this is
>>>>   >       confusing
>>>>   >       and will fix it.
>>>>   >
>>>>   >       >>
>>>>   >       >> I have a question on the behavior of the bind
>>>>operation when
>>>>   no
>>>>   >       input fence
>>>>   >       >> is provided. Let say I do :
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence1)
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence2)
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence3)
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> In what order are the fences going to be signaled?
>>>>   >       >>
>>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>>>>   >       >>
>>>>   >       >> Because you wrote "serialized I assume it's : in order
>>>>   >       >>
>>>>   >
>>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that
>>>>bind and
>>>>   unbind
>>>>   >       will use
>>>>   >       the same queue and hence are ordered.
>>>>   >
>>>>   >       >>
>>>>   >       >> One thing I didn't realize is that because we only get
one
>>>>   >       "VM_BIND" engine,
>>>>   >       >> there is a disconnect from the Vulkan specification.
>>>>   >       >>
>>>>   >       >> In Vulkan VM_BIND operations are serialized but
>>>>per engine.
>>>>   >       >>
>>>>   >       >> So you could have something like this :
>>>>   >       >>
>>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
out_fence=fence2)
>>>>   >       >>
>>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
out_fence=fence4)
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> fence1 is not signaled
>>>>   >       >>
>>>>   >       >> fence3 is signaled
>>>>   >       >>
>>>>   >       >> So the second VM_BIND will proceed before the
>>>>first VM_BIND.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> I guess we can deal with that scenario in
>>>>userspace by doing
>>>>   the
>>>>   >       wait
>>>>   >       >> ourselves in one thread per engines.
>>>>   >       >>
>>>>   >       >> But then it makes the VM_BIND input fences useless.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> Daniel : what do you think? Should be rework this or
just
>>>>   deal with
>>>>   >       wait
>>>>   >       >> fences in userspace?
>>>>   >       >>
>>>>   >       >
>>>>   >       >My opinion is rework this but make the ordering via
>>>>an engine
>>>>   param
>>>>   >       optional.
>>>>   >       >
>>>>   >       >e.g. A VM can be configured so all binds are ordered
>>>>within the
>>>>   VM
>>>>   >       >
>>>>   >       >e.g. A VM can be configured so all binds accept an
engine
>>>>   argument
>>>>   >       (in
>>>>   >       >the case of the i915 likely this is a gem context
>>>>handle) and
>>>>   binds
>>>>   >       >ordered with respect to that engine.
>>>>   >       >
>>>>   >       >This gives UMDs options as the later likely consumes
>>>>more KMD
>>>>   >       resources
>>>>   >       >so if a different UMD can live with binds being
>>>>ordered within
>>>>   the VM
>>>>   >       >they can use a mode consuming less resources.
>>>>   >       >
>>>>   >
>>>>   >       I think we need to be careful here if we are looking for
some
>>>>   out of
>>>>   >       (submission) order completion of vm_bind/unbind.
>>>>   >       In-order completion means, in a batch of binds and
>>>>unbinds to be
>>>>   >       completed in-order, user only needs to specify
>>>>in-fence for the
>>>>   >       first bind/unbind call and the our-fence for the last
>>>>   bind/unbind
>>>>   >       call. Also, the VA released by an unbind call can be
>>>>re-used by
>>>>   >       any subsequent bind call in that in-order batch.
>>>>   >
>>>>   >       These things will break if binding/unbinding were to
>>>>be allowed
>>>>   to
>>>>   >       go out of order (of submission) and user need to be extra
>>>>   careful
>>>>   >       not to run into pre-mature triggereing of out-fence and
bind
>>>>   failing
>>>>   >       as VA is still in use etc.
>>>>   >
>>>>   >       Also, VM_BIND binds the provided mapping on the specified
>>>>   address
>>>>   >       space
>>>>   >       (VM). So, the uapi is not engine/context specific.
>>>>   >
>>>>   >       We can however add a 'queue' to the uapi which can be
>>>>one from
>>>>   the
>>>>   >       pre-defined queues,
>>>>   >       I915_VM_BIND_QUEUE_0
>>>>   >       I915_VM_BIND_QUEUE_1
>>>>   >       ...
>>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>>>   >
>>>>   >       KMD will spawn an async work queue for each queue which
will
>>>>   only
>>>>   >       bind the mappings on that queue in the order of
submission.
>>>>   >       User can assign the queue to per engine or anything
>>>>like that.
>>>>   >
>>>>   >       But again here, user need to be careful and not
>>>>deadlock these
>>>>   >       queues with circular dependency of fences.
>>>>   >
>>>>   >       I prefer adding this later an as extension based on
>>>>whether it
>>>>   >       is really helping with the implementation.
>>>>   >
>>>>   >     I can tell you right now that having everything on a single
>>>>   in-order
>>>>   >     queue will not get us the perf we want.  What vulkan
>>>>really wants
>>>>   is one
>>>>   >     of two things:
>>>>   >      1. No implicit ordering of VM_BIND ops.  They just happen
in
>>>>   whatever
>>>>   >     their dependencies are resolved and we ensure ordering
>>>>ourselves
>>>>   by
>>>>   >     having a syncobj in the VkQueue.
>>>>   >      2. The ability to create multiple VM_BIND queues.  We need
at
>>>>   least 2
>>>>   >     but I don't see why there needs to be a limit besides
>>>>the limits
>>>>   the
>>>>   >     i915 API already has on the number of engines.  Vulkan
could
>>>>   expose
>>>>   >     multiple sparse binding queues to the client if it's not
>>>>   arbitrarily
>>>>   >     limited.
>>>>
>>>>   Thanks Jason, Lionel.
>>>>
>>>>   Jason, what are you referring to when you say "limits the i915
API
>>>>   already
>>>>   has on the number of engines"? I am not sure if there is such an
uapi
>>>>   today.
>>>>
>>>> There's a limit of something like 64 total engines today based on
the
>>>> number of bits we can cram into the exec flags in execbuffer2.  I
think
>>>> someone had an extended version that allowed more but I ripped it
out
>>>> because no one was using it.  Of course, execbuffer3 might not
>>>>have that
>>>> problem at all.
>>>>
>>>
>>>Thanks Jason.
>>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
probably
>>>will not have this limiation. So, we need to define a
VM_BIND_MAX_QUEUE
>>>and somehow export it to user (I am thinking of embedding it in
>>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' meaning
2^n
>>>queues.
>>
>>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which
execbuf3

Yup! That's exactly the limit I was talking about.

>>will also have. So, we can simply define in vm_bind/unbind structures,
>>
>>#define I915_VM_BIND_MAX_QUEUE   64
>>        __u32 queue;
>>
>>I think that will keep things simple.
>
>Hmmm? What does execbuf2 limit has to do with how many engines
>hardware can have? I suggest not to do that.
>
>Change with added this:
>
>       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>               return -EINVAL;
>
>To context creation needs to be undone and so let users create engine
>maps with all hardware engines, and let execbuf3 access them all.
>

Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
make it N+1).
But, as discussed in other thread of this RFC series, we are planning
to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
any uapi that limits the number of engines (and hence the vm_bind queues
need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large
(__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
work_item and a linked list) lookup from the user specified queue index.
Other option is to just put some hard limit (say 64 or 65) and use
an array of queues in VM (each created upon first use). I prefer this.

In execbuff3, if the user specified execbuf3.engine_id is beyond the number of available engines on the gem context, an error is returned to the user. In VM_BIND case, not sure how to do that bound check on user specified queue_idx.

In any case, it is an implementation detail and we can use a hashmap for the VM_BIND queues here (there might be a slight ioctl latency added due to hash lookup, but in normal case, should be insignificant), which should be Ok.

Niranjana

...

--Jason

Niranjana

>Regards,
>
>Tvrtko
>
>>
>>Niranjana
>>
>>>
>>>>   I am trying to see how many queues we need and don't want it to
be
>>>>   arbitrarily
>>>>   large and unduely blow up memory usage and complexity in i915
driver.
>>>>
>>>> I expect a Vulkan driver to use at most 2 in the vast majority
>>>>of cases. I
>>>> could imagine a client wanting to create more than 1 sparse
>>>>queue in which
>>>> case, it'll be N+1 but that's unlikely.  As far as complexity
>>>>goes, once
>>>> you allow two, I don't think the complexity is going up by
>>>>allowing N.  As
>>>> for memory usage, creating more queues means more memory.  That's a
>>>> trade-off that userspace can make.  Again, the expected number
>>>>here is 1
>>>> or 2 in the vast majority of cases so I don't think you need to
worry.
>>>
>>>Ok, will start with n=3 meaning 8 queues.
>>>That would require us create 8 workqueues.
>>>We can change 'n' later if required.
>>>
>>>Niranjana
>>>
>>>>
>>>>   >     Why?  Because Vulkan has two basic kind of bind
>>>>operations and we
>>>>   don't
>>>>   >     want any dependencies between them:
>>>>   >      1. Immediate.  These happen right after BO creation or
>>>>maybe as
>>>>   part of
>>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>>>>don't happen
>>>>   on a
>>>>   >     queue and we don't want them serialized with anything.  To
>>>>   synchronize
>>>>   >     with submit, we'll have a syncobj in the VkDevice which is
>>>>   signaled by
>>>>   >     all immediate bind operations and make submits wait on it.
>>>>   >      2. Queued (sparse): These happen on a VkQueue which may be
the
>>>>   same as
>>>>   >     a render/compute queue or may be its own queue.  It's up to
us
>>>>   what we
>>>>   >     want to advertise.  From the Vulkan API PoV, this is like
any
>>>>   other
>>>>   >     queue.  Operations on it wait on and signal semaphores.  If
we
>>>>   have a
>>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>>>>signal just like
>>>>   we do
>>>>   >     in execbuf().
>>>>   >     The important thing is that we don't want one type of
>>>>operation to
>>>>   block
>>>>   >     on the other.  If immediate binds are blocking on sparse
binds,
>>>>   it's
>>>>   >     going to cause over-synchronization issues.
>>>>   >     In terms of the internal implementation, I know that
>>>>there's going
>>>>   to be
>>>>   >     a lock on the VM and that we can't actually do these things
in
>>>>   >     parallel.  That's fine.  Once the dma_fences have signaled
and
>>>>   we're
>>>>
>>>>   Thats correct. It is like a single VM_BIND engine with
>>>>multiple queues
>>>>   feeding to it.
>>>>
>>>> Right.  As long as the queues themselves are independent and
>>>>can block on
>>>> dma_fences without holding up other queues, I think we're fine.
>>>>
>>>>   >     unblocked to do the bind operation, I don't care if
>>>>there's a bit
>>>>   of
>>>>   >     synchronization due to locking.  That's expected.  What
>>>>we can't
>>>>   afford
>>>>   >     to have is an immediate bind operation suddenly blocking on
a
>>>>   sparse
>>>>   >     operation which is blocked on a compute job that's going to
run
>>>>   for
>>>>   >     another 5ms.
>>>>
>>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block
the
>>>>   VM_BIND
>>>>   on other VMs. I am not sure about usecases here, but just wanted
to
>>>>   clarify.
>>>>
>>>> Yes, that's what I would expect.
>>>> --Jason
>>>>
>>>>   Niranjana
>>>>
>>>>   >     For reference, Windows solves this by allowing arbitrarily
many
>>>>   paging
>>>>   >     queues (what they call a VM_BIND engine/queue).  That
>>>>design works
>>>>   >     pretty well and solves the problems in question. 
>>>>Again, we could
>>>>   just
>>>>   >     make everything out-of-order and require using syncobjs
>>>>to order
>>>>   things
>>>>   >     as userspace wants. That'd be fine too.
>>>>   >     One more note while I'm here: danvet said something on
>>>>IRC about
>>>>   VM_BIND
>>>>   >     queues waiting for syncobjs to materialize.  We don't
really
>>>>   want/need
>>>>   >     this.  We already have all the machinery in userspace to
handle
>>>>   >     wait-before-signal and waiting for syncobj fences to
>>>>materialize
>>>>   and
>>>>   >     that machinery is on by default.  It would actually
>>>>take MORE work
>>>>   in
>>>>   >     Mesa to turn it off and take advantage of the kernel
>>>>being able to
>>>>   wait
>>>>   >     for syncobjs to materialize.  Also, getting that right is
>>>>   ridiculously
>>>>   >     hard and I really don't want to get it wrong in kernel
>>>>space.     When we
>>>>   >     do memory fences, wait-before-signal will be a thing.  We
don't
>>>>   need to
>>>>   >     try and make it a thing for syncobj.
>>>>   >     --Jason
>>>>   >
>>>>   >   Thanks Jason,
>>>>   >
>>>>   >   I missed the bit in the Vulkan spec that we're allowed to
have a
>>>>   sparse
>>>>   >   queue that does not implement either graphics or compute
>>>>operations
>>>>   :
>>>>   >
>>>>   >     "While some implementations may include
>>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>>>   >     support in queue families that also include
>>>>   >
>>>>   >      graphics and compute support, other implementations may
only
>>>>   expose a
>>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>>   >
>>>>   >      family."
>>>>   >
>>>>   >   So it can all be all a vm_bind engine that just does
bind/unbind
>>>>   >   operations.
>>>>   >
>>>>   >   But yes we need another engine for the immediate/non-sparse
>>>>   operations.
>>>>   >
>>>>   >   -Lionel
>>>>   >
>>>>   >         >
>>>>   >       Daniel, any thoughts?
>>>>   >
>>>>   >       Niranjana
>>>>   >
>>>>   >       >Matt
>>>>   >       >
>>>>   >       >>
>>>>   >       >> Sorry I noticed this late.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> -Lionel
>>>>   >       >>
>>>>   >       >>

Lionel Landwerlin

9 Jun 9 Jun

2:49 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On 09/06/2022 00:55, Jason Ekstrand wrote:

...

On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
>
>
>On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
>>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana
Vishwanathapura wrote:
>>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
>>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
>>>> <niranjana.vishwanathapura@intel.com> wrote:
>>>>
>>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
wrote:
>>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
>>>>   >
>>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
>>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
>>>>   >
>>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
>>>>Brost wrote:
>>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
Landwerlin
>>>>   wrote:
>>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
wrote:
>>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
>>>>   binding/unbinding
>>>>   >       the mapping in an
>>>>   >       >> > +async worker. The binding and unbinding will
>>>>work like a
>>>>   special
>>>>   >       GPU engine.
>>>>   >       >> > +The binding and unbinding operations are
serialized and
>>>>   will
>>>>   >       wait on specified
>>>>   >       >> > +input fences before the operation and will
signal the
>>>>   output
>>>>   >       fences upon the
>>>>   >       >> > +completion of the operation. Due to
serialization,
>>>>   completion of
>>>>   >       an operation
>>>>   >       >> > +will also indicate that all previous operations
>>>>are also
>>>>   >       complete.
>>>>   >       >>
>>>>   >       >> I guess we should avoid saying "will immediately
start
>>>>   >       binding/unbinding" if
>>>>   >       >> there are fences involved.
>>>>   >       >>
>>>>   >       >> And the fact that it's happening in an async
>>>>worker seem to
>>>>   imply
>>>>   >       it's not
>>>>   >       >> immediate.
>>>>   >       >>
>>>>   >
>>>>   >       Ok, will fix.
>>>>   >       This was added because in earlier design binding
was deferred
>>>>   until
>>>>   >       next execbuff.
>>>>   >       But now it is non-deferred (immediate in that sense).
>>>>But yah,
>>>>   this is
>>>>   >       confusing
>>>>   >       and will fix it.
>>>>   >
>>>>   >       >>
>>>>   >       >> I have a question on the behavior of the bind
>>>>operation when
>>>>   no
>>>>   >       input fence
>>>>   >       >> is provided. Let say I do :
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence1)
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence2)
>>>>   >       >>
>>>>   >       >> VM_BIND (out_fence=fence3)
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> In what order are the fences going to be signaled?
>>>>   >       >>
>>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
>>>>   >       >>
>>>>   >       >> Because you wrote "serialized I assume it's : in
order
>>>>   >       >>
>>>>   >
>>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that
>>>>bind and
>>>>   unbind
>>>>   >       will use
>>>>   >       the same queue and hence are ordered.
>>>>   >
>>>>   >       >>
>>>>   >       >> One thing I didn't realize is that because we
only get one
>>>>   >       "VM_BIND" engine,
>>>>   >       >> there is a disconnect from the Vulkan specification.
>>>>   >       >>
>>>>   >       >> In Vulkan VM_BIND operations are serialized but
>>>>per engine.
>>>>   >       >>
>>>>   >       >> So you could have something like this :
>>>>   >       >>
>>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
out_fence=fence2)
>>>>   >       >>
>>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
out_fence=fence4)
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> fence1 is not signaled
>>>>   >       >>
>>>>   >       >> fence3 is signaled
>>>>   >       >>
>>>>   >       >> So the second VM_BIND will proceed before the
>>>>first VM_BIND.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> I guess we can deal with that scenario in
>>>>userspace by doing
>>>>   the
>>>>   >       wait
>>>>   >       >> ourselves in one thread per engines.
>>>>   >       >>
>>>>   >       >> But then it makes the VM_BIND input fences useless.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> Daniel : what do you think? Should be rework
this or just
>>>>   deal with
>>>>   >       wait
>>>>   >       >> fences in userspace?
>>>>   >       >>
>>>>   >       >
>>>>   >       >My opinion is rework this but make the ordering via
>>>>an engine
>>>>   param
>>>>   >       optional.
>>>>   >       >
>>>>   >       >e.g. A VM can be configured so all binds are ordered
>>>>within the
>>>>   VM
>>>>   >       >
>>>>   >       >e.g. A VM can be configured so all binds accept an
engine
>>>>   argument
>>>>   >       (in
>>>>   >       >the case of the i915 likely this is a gem context
>>>>handle) and
>>>>   binds
>>>>   >       >ordered with respect to that engine.
>>>>   >       >
>>>>   >       >This gives UMDs options as the later likely consumes
>>>>more KMD
>>>>   >       resources
>>>>   >       >so if a different UMD can live with binds being
>>>>ordered within
>>>>   the VM
>>>>   >       >they can use a mode consuming less resources.
>>>>   >       >
>>>>   >
>>>>   >       I think we need to be careful here if we are
looking for some
>>>>   out of
>>>>   >       (submission) order completion of vm_bind/unbind.
>>>>   >       In-order completion means, in a batch of binds and
>>>>unbinds to be
>>>>   >       completed in-order, user only needs to specify
>>>>in-fence for the
>>>>   >       first bind/unbind call and the our-fence for the last
>>>>   bind/unbind
>>>>   >       call. Also, the VA released by an unbind call can be
>>>>re-used by
>>>>   >       any subsequent bind call in that in-order batch.
>>>>   >
>>>>   >       These things will break if binding/unbinding were to
>>>>be allowed
>>>>   to
>>>>   >       go out of order (of submission) and user need to be
extra
>>>>   careful
>>>>   >       not to run into pre-mature triggereing of out-fence
and bind
>>>>   failing
>>>>   >       as VA is still in use etc.
>>>>   >
>>>>   >       Also, VM_BIND binds the provided mapping on the
specified
>>>>   address
>>>>   >       space
>>>>   >       (VM). So, the uapi is not engine/context specific.
>>>>   >
>>>>   >       We can however add a 'queue' to the uapi which can be
>>>>one from
>>>>   the
>>>>   >       pre-defined queues,
>>>>   >       I915_VM_BIND_QUEUE_0
>>>>   >       I915_VM_BIND_QUEUE_1
>>>>   >       ...
>>>>   >       I915_VM_BIND_QUEUE_(N-1)
>>>>   >
>>>>   >       KMD will spawn an async work queue for each queue
which will
>>>>   only
>>>>   >       bind the mappings on that queue in the order of
submission.
>>>>   >       User can assign the queue to per engine or anything
>>>>like that.
>>>>   >
>>>>   >       But again here, user need to be careful and not
>>>>deadlock these
>>>>   >       queues with circular dependency of fences.
>>>>   >
>>>>   >       I prefer adding this later an as extension based on
>>>>whether it
>>>>   >       is really helping with the implementation.
>>>>   >
>>>>   >     I can tell you right now that having everything on a
single
>>>>   in-order
>>>>   >     queue will not get us the perf we want.  What vulkan
>>>>really wants
>>>>   is one
>>>>   >     of two things:
>>>>   >      1. No implicit ordering of VM_BIND ops.  They just
happen in
>>>>   whatever
>>>>   >     their dependencies are resolved and we ensure ordering
>>>>ourselves
>>>>   by
>>>>   >     having a syncobj in the VkQueue.
>>>>   >      2. The ability to create multiple VM_BIND queues. 
We need at
>>>>   least 2
>>>>   >     but I don't see why there needs to be a limit besides
>>>>the limits
>>>>   the
>>>>   >     i915 API already has on the number of engines. 
Vulkan could
>>>>   expose
>>>>   >     multiple sparse binding queues to the client if it's not
>>>>   arbitrarily
>>>>   >     limited.
>>>>
>>>>   Thanks Jason, Lionel.
>>>>
>>>>   Jason, what are you referring to when you say "limits the
i915 API
>>>>   already
>>>>   has on the number of engines"? I am not sure if there is
such an uapi
>>>>   today.
>>>>
>>>> There's a limit of something like 64 total engines today
based on the
>>>> number of bits we can cram into the exec flags in
execbuffer2.  I think
>>>> someone had an extended version that allowed more but I
ripped it out
>>>> because no one was using it.  Of course, execbuffer3 might not
>>>>have that
>>>> problem at all.
>>>>
>>>
>>>Thanks Jason.
>>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
probably
>>>will not have this limiation. So, we need to define a
VM_BIND_MAX_QUEUE
>>>and somehow export it to user (I am thinking of embedding it in
>>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
meaning 2^n
>>>queues.
>>
>>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f)
which execbuf3

Yup! That's exactly the limit I was talking about.

>>will also have. So, we can simply define in vm_bind/unbind
structures,
>>
>>#define I915_VM_BIND_MAX_QUEUE   64
>>        __u32 queue;
>>
>>I think that will keep things simple.
>
>Hmmm? What does execbuf2 limit has to do with how many engines
>hardware can have? I suggest not to do that.
>
>Change with added this:
>
>       if (set.num_engines > I915_EXEC_RING_MASK + 1)
>               return -EINVAL;
>
>To context creation needs to be undone and so let users create
engine
>maps with all hardware engines, and let execbuf3 access them all.
>

Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
make it N+1).
But, as discussed in other thread of this RFC series, we are planning
to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
any uapi that limits the number of engines (and hence the vm_bind
queues
need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large
(__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
work_item and a linked list) lookup from the user specified queue
index.
Other option is to just put some hard limit (say 64 or 65) and use
an array of queues in VM (each created upon first use). I prefer this.

--Jason

I kind of agree here. Wouldn't be simpler to have the bind queue created like the others when we build the engine map?

For userspace it's then just matter of selecting the right queue ID when submitting.

If there is ever a possibility to have this work on the GPU, it would be all ready.

Thanks,

-Lionel

...

Niranjana

>Regards,
>
>Tvrtko
>
>>
>>Niranjana
>>
>>>
>>>>   I am trying to see how many queues we need and don't want
it to be
>>>>   arbitrarily
>>>>   large and unduely blow up memory usage and complexity in
i915 driver.
>>>>
>>>> I expect a Vulkan driver to use at most 2 in the vast majority
>>>>of cases. I
>>>> could imagine a client wanting to create more than 1 sparse
>>>>queue in which
>>>> case, it'll be N+1 but that's unlikely.  As far as complexity
>>>>goes, once
>>>> you allow two, I don't think the complexity is going up by
>>>>allowing N.  As
>>>> for memory usage, creating more queues means more memory. 
That's a
>>>> trade-off that userspace can make.  Again, the expected number
>>>>here is 1
>>>> or 2 in the vast majority of cases so I don't think you need
to worry.
>>>
>>>Ok, will start with n=3 meaning 8 queues.
>>>That would require us create 8 workqueues.
>>>We can change 'n' later if required.
>>>
>>>Niranjana
>>>
>>>>
>>>>   >     Why?  Because Vulkan has two basic kind of bind
>>>>operations and we
>>>>   don't
>>>>   >     want any dependencies between them:
>>>>   >      1. Immediate.  These happen right after BO creation or
>>>>maybe as
>>>>   part of
>>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
>>>>don't happen
>>>>   on a
>>>>   >     queue and we don't want them serialized with
anything.  To
>>>>   synchronize
>>>>   >     with submit, we'll have a syncobj in the VkDevice
which is
>>>>   signaled by
>>>>   >     all immediate bind operations and make submits wait
on it.
>>>>   >      2. Queued (sparse): These happen on a VkQueue which
may be the
>>>>   same as
>>>>   >     a render/compute queue or may be its own queue.  It's
up to us
>>>>   what we
>>>>   >     want to advertise.  From the Vulkan API PoV, this is
like any
>>>>   other
>>>>   >     queue.  Operations on it wait on and signal
semaphores.  If we
>>>>   have a
>>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
>>>>signal just like
>>>>   we do
>>>>   >     in execbuf().
>>>>   >     The important thing is that we don't want one type of
>>>>operation to
>>>>   block
>>>>   >     on the other.  If immediate binds are blocking on
sparse binds,
>>>>   it's
>>>>   >     going to cause over-synchronization issues.
>>>>   >     In terms of the internal implementation, I know that
>>>>there's going
>>>>   to be
>>>>   >     a lock on the VM and that we can't actually do these
things in
>>>>   >     parallel.  That's fine.  Once the dma_fences have
signaled and
>>>>   we're
>>>>
>>>>   Thats correct. It is like a single VM_BIND engine with
>>>>multiple queues
>>>>   feeding to it.
>>>>
>>>> Right.  As long as the queues themselves are independent and
>>>>can block on
>>>> dma_fences without holding up other queues, I think we're fine.
>>>>
>>>>   >     unblocked to do the bind operation, I don't care if
>>>>there's a bit
>>>>   of
>>>>   >     synchronization due to locking. That's expected.  What
>>>>we can't
>>>>   afford
>>>>   >     to have is an immediate bind operation suddenly
blocking on a
>>>>   sparse
>>>>   >     operation which is blocked on a compute job that's
going to run
>>>>   for
>>>>   >     another 5ms.
>>>>
>>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't
block the
>>>>   VM_BIND
>>>>   on other VMs. I am not sure about usecases here, but just
wanted to
>>>>   clarify.
>>>>
>>>> Yes, that's what I would expect.
>>>> --Jason
>>>>
>>>>   Niranjana
>>>>
>>>>   >     For reference, Windows solves this by allowing
arbitrarily many
>>>>   paging
>>>>   >     queues (what they call a VM_BIND engine/queue).  That
>>>>design works
>>>>   >     pretty well and solves the problems in question.
>>>>Again, we could
>>>>   just
>>>>   >     make everything out-of-order and require using syncobjs
>>>>to order
>>>>   things
>>>>   >     as userspace wants. That'd be fine too.
>>>>   >     One more note while I'm here: danvet said something on
>>>>IRC about
>>>>   VM_BIND
>>>>   >     queues waiting for syncobjs to materialize.  We don't
really
>>>>   want/need
>>>>   >     this.  We already have all the machinery in userspace
to handle
>>>>   >     wait-before-signal and waiting for syncobj fences to
>>>>materialize
>>>>   and
>>>>   >     that machinery is on by default.  It would actually
>>>>take MORE work
>>>>   in
>>>>   >     Mesa to turn it off and take advantage of the kernel
>>>>being able to
>>>>   wait
>>>>   >     for syncobjs to materialize. Also, getting that right is
>>>>   ridiculously
>>>>   >     hard and I really don't want to get it wrong in kernel
>>>>space.     When we
>>>>   >     do memory fences, wait-before-signal will be a
thing.  We don't
>>>>   need to
>>>>   >     try and make it a thing for syncobj.
>>>>   >     --Jason
>>>>   >
>>>>   >   Thanks Jason,
>>>>   >
>>>>   >   I missed the bit in the Vulkan spec that we're allowed
to have a
>>>>   sparse
>>>>   >   queue that does not implement either graphics or compute
>>>>operations
>>>>   :
>>>>   >
>>>>   >     "While some implementations may include
>>>>   VK_QUEUE_SPARSE_BINDING_BIT
>>>>   >     support in queue families that also include
>>>>   >
>>>>   >      graphics and compute support, other implementations
may only
>>>>   expose a
>>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
>>>>   >
>>>>   >      family."
>>>>   >
>>>>   >   So it can all be all a vm_bind engine that just does
bind/unbind
>>>>   >   operations.
>>>>   >
>>>>   >   But yes we need another engine for the immediate/non-sparse
>>>>   operations.
>>>>   >
>>>>   >   -Lionel
>>>>   >
>>>>   >         >
>>>>   >       Daniel, any thoughts?
>>>>   >
>>>>   >       Niranjana
>>>>   >
>>>>   >       >Matt
>>>>   >       >
>>>>   >       >>
>>>>   >       >> Sorry I noticed this late.
>>>>   >       >>
>>>>   >       >>
>>>>   >       >> -Lionel
>>>>   >       >>
>>>>   >       >>

Niranjana Vishwanathapura

7:31 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:

...

On 09/06/2022 00:55, Jason Ekstrand wrote:

On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura
<niranjana.vishwanathapura@intel.com> wrote:

  On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:
  >
  >
  >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:
  >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura
  wrote:
  >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:
  >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura
  >>>> <niranjana.vishwanathapura@intel.com> wrote:
  >>>>
  >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin
  wrote:
  >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:
  >>>>   >
  >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura
  >>>>   >     <niranjana.vishwanathapura@intel.com> wrote:
  >>>>   >
  >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew
  >>>>Brost wrote:
  >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel
  Landwerlin
  >>>>   wrote:
  >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura
  wrote:
  >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start
  >>>>   binding/unbinding
  >>>>   >       the mapping in an
  >>>>   >       >> > +async worker. The binding and unbinding will
  >>>>work like a
  >>>>   special
  >>>>   >       GPU engine.
  >>>>   >       >> > +The binding and unbinding operations are
  serialized and
  >>>>   will
  >>>>   >       wait on specified
  >>>>   >       >> > +input fences before the operation and will signal
  the
  >>>>   output
  >>>>   >       fences upon the
  >>>>   >       >> > +completion of the operation. Due to
  serialization,
  >>>>   completion of
  >>>>   >       an operation
  >>>>   >       >> > +will also indicate that all previous operations
  >>>>are also
  >>>>   >       complete.
  >>>>   >       >>
  >>>>   >       >> I guess we should avoid saying "will immediately
  start
  >>>>   >       binding/unbinding" if
  >>>>   >       >> there are fences involved.
  >>>>   >       >>
  >>>>   >       >> And the fact that it's happening in an async
  >>>>worker seem to
  >>>>   imply
  >>>>   >       it's not
  >>>>   >       >> immediate.
  >>>>   >       >>
  >>>>   >
  >>>>   >       Ok, will fix.
  >>>>   >       This was added because in earlier design binding was
  deferred
  >>>>   until
  >>>>   >       next execbuff.
  >>>>   >       But now it is non-deferred (immediate in that sense).
  >>>>But yah,
  >>>>   this is
  >>>>   >       confusing
  >>>>   >       and will fix it.
  >>>>   >
  >>>>   >       >>
  >>>>   >       >> I have a question on the behavior of the bind
  >>>>operation when
  >>>>   no
  >>>>   >       input fence
  >>>>   >       >> is provided. Let say I do :
  >>>>   >       >>
  >>>>   >       >> VM_BIND (out_fence=fence1)
  >>>>   >       >>
  >>>>   >       >> VM_BIND (out_fence=fence2)
  >>>>   >       >>
  >>>>   >       >> VM_BIND (out_fence=fence3)
  >>>>   >       >>
  >>>>   >       >>
  >>>>   >       >> In what order are the fences going to be signaled?
  >>>>   >       >>
  >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?
  >>>>   >       >>
  >>>>   >       >> Because you wrote "serialized I assume it's : in
  order
  >>>>   >       >>
  >>>>   >
  >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that
  >>>>bind and
  >>>>   unbind
  >>>>   >       will use
  >>>>   >       the same queue and hence are ordered.
  >>>>   >
  >>>>   >       >>
  >>>>   >       >> One thing I didn't realize is that because we only
  get one
  >>>>   >       "VM_BIND" engine,
  >>>>   >       >> there is a disconnect from the Vulkan specification.
  >>>>   >       >>
  >>>>   >       >> In Vulkan VM_BIND operations are serialized but
  >>>>per engine.
  >>>>   >       >>
  >>>>   >       >> So you could have something like this :
  >>>>   >       >>
  >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,
  out_fence=fence2)
  >>>>   >       >>
  >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,
  out_fence=fence4)
  >>>>   >       >>
  >>>>   >       >>
  >>>>   >       >> fence1 is not signaled
  >>>>   >       >>
  >>>>   >       >> fence3 is signaled
  >>>>   >       >>
  >>>>   >       >> So the second VM_BIND will proceed before the
  >>>>first VM_BIND.
  >>>>   >       >>
  >>>>   >       >>
  >>>>   >       >> I guess we can deal with that scenario in
  >>>>userspace by doing
  >>>>   the
  >>>>   >       wait
  >>>>   >       >> ourselves in one thread per engines.
  >>>>   >       >>
  >>>>   >       >> But then it makes the VM_BIND input fences useless.
  >>>>   >       >>
  >>>>   >       >>
  >>>>   >       >> Daniel : what do you think? Should be rework this or
  just
  >>>>   deal with
  >>>>   >       wait
  >>>>   >       >> fences in userspace?
  >>>>   >       >>
  >>>>   >       >
  >>>>   >       >My opinion is rework this but make the ordering via
  >>>>an engine
  >>>>   param
  >>>>   >       optional.
  >>>>   >       >
  >>>>   >       >e.g. A VM can be configured so all binds are ordered
  >>>>within the
  >>>>   VM
  >>>>   >       >
  >>>>   >       >e.g. A VM can be configured so all binds accept an
  engine
  >>>>   argument
  >>>>   >       (in
  >>>>   >       >the case of the i915 likely this is a gem context
  >>>>handle) and
  >>>>   binds
  >>>>   >       >ordered with respect to that engine.
  >>>>   >       >
  >>>>   >       >This gives UMDs options as the later likely consumes
  >>>>more KMD
  >>>>   >       resources
  >>>>   >       >so if a different UMD can live with binds being
  >>>>ordered within
  >>>>   the VM
  >>>>   >       >they can use a mode consuming less resources.
  >>>>   >       >
  >>>>   >
  >>>>   >       I think we need to be careful here if we are looking
  for some
  >>>>   out of
  >>>>   >       (submission) order completion of vm_bind/unbind.
  >>>>   >       In-order completion means, in a batch of binds and
  >>>>unbinds to be
  >>>>   >       completed in-order, user only needs to specify
  >>>>in-fence for the
  >>>>   >       first bind/unbind call and the our-fence for the last
  >>>>   bind/unbind
  >>>>   >       call. Also, the VA released by an unbind call can be
  >>>>re-used by
  >>>>   >       any subsequent bind call in that in-order batch.
  >>>>   >
  >>>>   >       These things will break if binding/unbinding were to
  >>>>be allowed
  >>>>   to
  >>>>   >       go out of order (of submission) and user need to be
  extra
  >>>>   careful
  >>>>   >       not to run into pre-mature triggereing of out-fence and
  bind
  >>>>   failing
  >>>>   >       as VA is still in use etc.
  >>>>   >
  >>>>   >       Also, VM_BIND binds the provided mapping on the
  specified
  >>>>   address
  >>>>   >       space
  >>>>   >       (VM). So, the uapi is not engine/context specific.
  >>>>   >
  >>>>   >       We can however add a 'queue' to the uapi which can be
  >>>>one from
  >>>>   the
  >>>>   >       pre-defined queues,
  >>>>   >       I915_VM_BIND_QUEUE_0
  >>>>   >       I915_VM_BIND_QUEUE_1
  >>>>   >       ...
  >>>>   >       I915_VM_BIND_QUEUE_(N-1)
  >>>>   >
  >>>>   >       KMD will spawn an async work queue for each queue which
  will
  >>>>   only
  >>>>   >       bind the mappings on that queue in the order of
  submission.
  >>>>   >       User can assign the queue to per engine or anything
  >>>>like that.
  >>>>   >
  >>>>   >       But again here, user need to be careful and not
  >>>>deadlock these
  >>>>   >       queues with circular dependency of fences.
  >>>>   >
  >>>>   >       I prefer adding this later an as extension based on
  >>>>whether it
  >>>>   >       is really helping with the implementation.
  >>>>   >
  >>>>   >     I can tell you right now that having everything on a
  single
  >>>>   in-order
  >>>>   >     queue will not get us the perf we want.  What vulkan
  >>>>really wants
  >>>>   is one
  >>>>   >     of two things:
  >>>>   >      1. No implicit ordering of VM_BIND ops.  They just
  happen in
  >>>>   whatever
  >>>>   >     their dependencies are resolved and we ensure ordering
  >>>>ourselves
  >>>>   by
  >>>>   >     having a syncobj in the VkQueue.
  >>>>   >      2. The ability to create multiple VM_BIND queues.  We
  need at
  >>>>   least 2
  >>>>   >     but I don't see why there needs to be a limit besides
  >>>>the limits
  >>>>   the
  >>>>   >     i915 API already has on the number of engines.  Vulkan
  could
  >>>>   expose
  >>>>   >     multiple sparse binding queues to the client if it's not
  >>>>   arbitrarily
  >>>>   >     limited.
  >>>>
  >>>>   Thanks Jason, Lionel.
  >>>>
  >>>>   Jason, what are you referring to when you say "limits the i915
  API
  >>>>   already
  >>>>   has on the number of engines"? I am not sure if there is such
  an uapi
  >>>>   today.
  >>>>
  >>>> There's a limit of something like 64 total engines today based on
  the
  >>>> number of bits we can cram into the exec flags in execbuffer2.  I
  think
  >>>> someone had an extended version that allowed more but I ripped it
  out
  >>>> because no one was using it.  Of course, execbuffer3 might not
  >>>>have that
  >>>> problem at all.
  >>>>
  >>>
  >>>Thanks Jason.
  >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3
  probably
  >>>will not have this limiation. So, we need to define a
  VM_BIND_MAX_QUEUE
  >>>and somehow export it to user (I am thinking of embedding it in
  >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'
  meaning 2^n
  >>>queues.
  >>
  >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which
  execbuf3

Yup!  That's exactly the limit I was talking about.


  >>will also have. So, we can simply define in vm_bind/unbind
  structures,
  >>
  >>#define I915_VM_BIND_MAX_QUEUE   64
  >>        __u32 queue;
  >>
  >>I think that will keep things simple.
  >
  >Hmmm? What does execbuf2 limit has to do with how many engines
  >hardware can have? I suggest not to do that.
  >
  >Change with added this:
  >
  >       if (set.num_engines > I915_EXEC_RING_MASK + 1)
  >               return -EINVAL;
  >
  >To context creation needs to be undone and so let users create engine
  >maps with all hardware engines, and let execbuf3 access them all.
  >

  Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.
  Hence, I was using the same limit for VM_BIND queues (64, or 65 if we
  make it N+1).
  But, as discussed in other thread of this RFC series, we are planning
  to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be
  any uapi that limits the number of engines (and hence the vm_bind
  queues
  need to be supported).

  If we leave the number of vm_bind queues to be arbitrarily large
  (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,
  work_item and a linked list) lookup from the user specified queue
  index.
  Other option is to just put some hard limit (say 64 or 65) and use
  an array of queues in VM (each created upon first use). I prefer this.

I don't get why a VM_BIND queue is any different from any other queue or
userspace-visible kernel object.  But I'll leave those details up to
danvet or whoever else might be reviewing the implementation.
--Jason

I kind of agree here. Wouldn't be simpler to have the bind queue created like the others when we build the engine map?

For userspace it's then just matter of selecting the right queue ID when submitting.

If there is ever a possibility to have this work on the GPU, it would be all ready.

I did sync offline with Matt Brost on this. We can add a VM_BIND engine class and let user create VM_BIND engines (queues). The problem is, in i915 engine creating interface is bound to gem_context. So, in vm_bind ioctl, we would need both context_id and queue_idx for proper lookup of the user created engine. This is bit ackward as vm_bind is an interface to VM (address space) and has nothing to do with gem_context. Another problem is, if two VMs are binding with the same defined engine, binding on VM1 can get unnecessary blocked by binding on VM2 (which may be waiting on its in_fence).

So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind ioctl, and the queues are per VM.

Niranjana

...

Thanks,

-Lionel

  Niranjana

  >Regards,
  >
  >Tvrtko
  >
  >>
  >>Niranjana
  >>
  >>>
  >>>>   I am trying to see how many queues we need and don't want it to
  be
  >>>>   arbitrarily
  >>>>   large and unduely blow up memory usage and complexity in i915
  driver.
  >>>>
  >>>> I expect a Vulkan driver to use at most 2 in the vast majority
  >>>>of cases. I
  >>>> could imagine a client wanting to create more than 1 sparse
  >>>>queue in which
  >>>> case, it'll be N+1 but that's unlikely.  As far as complexity
  >>>>goes, once
  >>>> you allow two, I don't think the complexity is going up by
  >>>>allowing N.  As
  >>>> for memory usage, creating more queues means more memory.  That's
  a
  >>>> trade-off that userspace can make.  Again, the expected number
  >>>>here is 1
  >>>> or 2 in the vast majority of cases so I don't think you need to
  worry.
  >>>
  >>>Ok, will start with n=3 meaning 8 queues.
  >>>That would require us create 8 workqueues.
  >>>We can change 'n' later if required.
  >>>
  >>>Niranjana
  >>>
  >>>>
  >>>>   >     Why?  Because Vulkan has two basic kind of bind
  >>>>operations and we
  >>>>   don't
  >>>>   >     want any dependencies between them:
  >>>>   >      1. Immediate.  These happen right after BO creation or
  >>>>maybe as
  >>>>   part of
  >>>>   >     vkBindImageMemory() or VkBindBufferMemory().  These
  >>>>don't happen
  >>>>   on a
  >>>>   >     queue and we don't want them serialized with anything. 
  To
  >>>>   synchronize
  >>>>   >     with submit, we'll have a syncobj in the VkDevice which
  is
  >>>>   signaled by
  >>>>   >     all immediate bind operations and make submits wait on
  it.
  >>>>   >      2. Queued (sparse): These happen on a VkQueue which may
  be the
  >>>>   same as
  >>>>   >     a render/compute queue or may be its own queue.  It's up
  to us
  >>>>   what we
  >>>>   >     want to advertise.  From the Vulkan API PoV, this is like
  any
  >>>>   other
  >>>>   >     queue.  Operations on it wait on and signal semaphores. 
  If we
  >>>>   have a
  >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and
  >>>>signal just like
  >>>>   we do
  >>>>   >     in execbuf().
  >>>>   >     The important thing is that we don't want one type of
  >>>>operation to
  >>>>   block
  >>>>   >     on the other.  If immediate binds are blocking on sparse
  binds,
  >>>>   it's
  >>>>   >     going to cause over-synchronization issues.
  >>>>   >     In terms of the internal implementation, I know that
  >>>>there's going
  >>>>   to be
  >>>>   >     a lock on the VM and that we can't actually do these
  things in
  >>>>   >     parallel.  That's fine.  Once the dma_fences have
  signaled and
  >>>>   we're
  >>>>
  >>>>   Thats correct. It is like a single VM_BIND engine with
  >>>>multiple queues
  >>>>   feeding to it.
  >>>>
  >>>> Right.  As long as the queues themselves are independent and
  >>>>can block on
  >>>> dma_fences without holding up other queues, I think we're fine.
  >>>>
  >>>>   >     unblocked to do the bind operation, I don't care if
  >>>>there's a bit
  >>>>   of
  >>>>   >     synchronization due to locking.  That's expected.  What
  >>>>we can't
  >>>>   afford
  >>>>   >     to have is an immediate bind operation suddenly blocking
  on a
  >>>>   sparse
  >>>>   >     operation which is blocked on a compute job that's going
  to run
  >>>>   for
  >>>>   >     another 5ms.
  >>>>
  >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block
  the
  >>>>   VM_BIND
  >>>>   on other VMs. I am not sure about usecases here, but just
  wanted to
  >>>>   clarify.
  >>>>
  >>>> Yes, that's what I would expect.
  >>>> --Jason
  >>>>
  >>>>   Niranjana
  >>>>
  >>>>   >     For reference, Windows solves this by allowing
  arbitrarily many
  >>>>   paging
  >>>>   >     queues (what they call a VM_BIND engine/queue).  That
  >>>>design works
  >>>>   >     pretty well and solves the problems in question. 
  >>>>Again, we could
  >>>>   just
  >>>>   >     make everything out-of-order and require using syncobjs
  >>>>to order
  >>>>   things
  >>>>   >     as userspace wants. That'd be fine too.
  >>>>   >     One more note while I'm here: danvet said something on
  >>>>IRC about
  >>>>   VM_BIND
  >>>>   >     queues waiting for syncobjs to materialize.  We don't
  really
  >>>>   want/need
  >>>>   >     this.  We already have all the machinery in userspace to
  handle
  >>>>   >     wait-before-signal and waiting for syncobj fences to
  >>>>materialize
  >>>>   and
  >>>>   >     that machinery is on by default.  It would actually
  >>>>take MORE work
  >>>>   in
  >>>>   >     Mesa to turn it off and take advantage of the kernel
  >>>>being able to
  >>>>   wait
  >>>>   >     for syncobjs to materialize.  Also, getting that right is
  >>>>   ridiculously
  >>>>   >     hard and I really don't want to get it wrong in kernel
  >>>>space.     When we
  >>>>   >     do memory fences, wait-before-signal will be a thing.  We
  don't
  >>>>   need to
  >>>>   >     try and make it a thing for syncobj.
  >>>>   >     --Jason
  >>>>   >
  >>>>   >   Thanks Jason,
  >>>>   >
  >>>>   >   I missed the bit in the Vulkan spec that we're allowed to
  have a
  >>>>   sparse
  >>>>   >   queue that does not implement either graphics or compute
  >>>>operations
  >>>>   :
  >>>>   >
  >>>>   >     "While some implementations may include
  >>>>   VK_QUEUE_SPARSE_BINDING_BIT
  >>>>   >     support in queue families that also include
  >>>>   >
  >>>>   >      graphics and compute support, other implementations may
  only
  >>>>   expose a
  >>>>   >     VK_QUEUE_SPARSE_BINDING_BIT-only queue
  >>>>   >
  >>>>   >      family."
  >>>>   >
  >>>>   >   So it can all be all a vm_bind engine that just does
  bind/unbind
  >>>>   >   operations.
  >>>>   >
  >>>>   >   But yes we need another engine for the immediate/non-sparse
  >>>>   operations.
  >>>>   >
  >>>>   >   -Lionel
  >>>>   >
  >>>>   >         >
  >>>>   >       Daniel, any thoughts?
  >>>>   >
  >>>>   >       Niranjana
  >>>>   >
  >>>>   >       >Matt
  >>>>   >       >
  >>>>   >       >>
  >>>>   >       >> Sorry I noticed this late.
  >>>>   >       >>
  >>>>   >       >>
  >>>>   >       >> -Lionel
  >>>>   >       >>
  >>>>   >       >>

Lionel Landwerlin

10 Jun 10 Jun

6:53 a.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:

...

On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:

...
On 09/06/2022 00:55, Jason Ekstrand wrote:

On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura     niranjana.vishwanathapura@intel.com wrote:

On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:       >       >       >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:       >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura       wrote:       >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:       >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura       >>>> niranjana.vishwanathapura@intel.com wrote:       >>>>       >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin       wrote:       >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:       >>>>   >       >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura       >>>>   > niranjana.vishwanathapura@intel.com wrote:       >>>>   >       >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew       >>>>Brost wrote:       >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel       Landwerlin       >>>>   wrote:       >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura       wrote:       >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start       >>>>   binding/unbinding       >>>>   >       the mapping in an       >>>>   >       >> > +async worker. The binding and unbinding will       >>>>work like a       >>>>   special       >>>>   >       GPU engine.       >>>>   >       >> > +The binding and unbinding operations are       serialized and       >>>>   will       >>>>   >       wait on specified       >>>>   >       >> > +input fences before the operation and will signal       the       >>>>   output       >>>>   >       fences upon the       >>>>   >       >> > +completion of the operation. Due to       serialization,       >>>>   completion of       >>>>   >       an operation       >>>>   >       >> > +will also indicate that all previous operations       >>>>are also       >>>>   >       complete.       >>>>   >       >>       >>>>   >       >> I guess we should avoid saying "will immediately       start       >>>>   >       binding/unbinding" if       >>>>   >       >> there are fences involved.       >>>>   >       >>       >>>>   >       >> And the fact that it's happening in an async       >>>>worker seem to       >>>>   imply       >>>>   >       it's not       >>>>   >       >> immediate.       >>>>   >       >>       >>>>   >       >>>>   >       Ok, will fix.       >>>>   >       This was added because in earlier design binding was       deferred       >>>>   until       >>>>   >       next execbuff.       >>>>   >       But now it is non-deferred (immediate in that sense).       >>>>But yah,       >>>>   this is       >>>>   >       confusing       >>>>   >       and will fix it.       >>>>   >       >>>>   >       >>       >>>>   >       >> I have a question on the behavior of the bind       >>>>operation when       >>>>   no       >>>>   >       input fence       >>>>   >       >> is provided. Let say I do :       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence1)       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence2)       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence3)       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> In what order are the fences going to be signaled?       >>>>   >       >>       >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?       >>>>   >       >>       >>>>   >       >> Because you wrote "serialized I assume it's : in       order       >>>>   >       >>       >>>>   >       >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that       >>>>bind and       >>>>   unbind       >>>>   >       will use       >>>>   >       the same queue and hence are ordered.       >>>>   >       >>>>   >       >>       >>>>   >       >> One thing I didn't realize is that because we only       get one       >>>>   >       "VM_BIND" engine,       >>>>   >       >> there is a disconnect from the Vulkan specification.       >>>>   >       >>       >>>>   >       >> In Vulkan VM_BIND operations are serialized but       >>>>per engine.       >>>>   >       >>       >>>>   >       >> So you could have something like this :       >>>>   >       >>       >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,       out_fence=fence2)       >>>>   >       >>       >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,       out_fence=fence4)       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> fence1 is not signaled       >>>>   >       >>       >>>>   >       >> fence3 is signaled       >>>>   >       >>       >>>>   >       >> So the second VM_BIND will proceed before the       >>>>first VM_BIND.       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> I guess we can deal with that scenario in       >>>>userspace by doing       >>>>   the       >>>>   >       wait       >>>>   >       >> ourselves in one thread per engines.       >>>>   >       >>       >>>>   >       >> But then it makes the VM_BIND input fences useless.       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> Daniel : what do you think? Should be rework this or       just       >>>>   deal with       >>>>   >       wait       >>>>   >       >> fences in userspace?       >>>>   >       >>       >>>>   >       >       >>>>   >       >My opinion is rework this but make the ordering via       >>>>an engine       >>>>   param       >>>>   >       optional.       >>>>   >       >       >>>>   >       >e.g. A VM can be configured so all binds are ordered       >>>>within the       >>>>   VM       >>>>   >       >       >>>>   >       >e.g. A VM can be configured so all binds accept an       engine       >>>>   argument       >>>>   >       (in       >>>>   >       >the case of the i915 likely this is a gem context       >>>>handle) and       >>>>   binds       >>>>   >       >ordered with respect to that engine.       >>>>   >       >       >>>>   >       >This gives UMDs options as the later likely consumes       >>>>more KMD       >>>>   >       resources       >>>>   >       >so if a different UMD can live with binds being       >>>>ordered within       >>>>   the VM       >>>>   >       >they can use a mode consuming less resources.       >>>>   >       >       >>>>   >       >>>>   >       I think we need to be careful here if we are looking       for some       >>>>   out of       >>>>   >       (submission) order completion of vm_bind/unbind.       >>>>   >       In-order completion means, in a batch of binds and       >>>>unbinds to be       >>>>   >       completed in-order, user only needs to specify       >>>>in-fence for the       >>>>   >       first bind/unbind call and the our-fence for the last       >>>>   bind/unbind       >>>>   >       call. Also, the VA released by an unbind call can be       >>>>re-used by       >>>>   >       any subsequent bind call in that in-order batch.       >>>>   >       >>>>   >       These things will break if binding/unbinding were to       >>>>be allowed       >>>>   to       >>>>   >       go out of order (of submission) and user need to be       extra       >>>>   careful       >>>>   >       not to run into pre-mature triggereing of out-fence and       bind       >>>>   failing       >>>>   >       as VA is still in use etc.       >>>>   >       >>>>   >       Also, VM_BIND binds the provided mapping on the       specified       >>>>   address       >>>>   >       space       >>>>   >       (VM). So, the uapi is not engine/context specific.       >>>>   >       >>>>   >       We can however add a 'queue' to the uapi which can be       >>>>one from       >>>>   the       >>>>   >       pre-defined queues,       >>>>   >       I915_VM_BIND_QUEUE_0       >>>>   >       I915_VM_BIND_QUEUE_1       >>>>   >       ...       >>>>   >       I915_VM_BIND_QUEUE_(N-1)       >>>>   >       >>>>   >       KMD will spawn an async work queue for each queue which       will       >>>>   only       >>>>   >       bind the mappings on that queue in the order of       submission.       >>>>   >       User can assign the queue to per engine or anything       >>>>like that.       >>>>   >       >>>>   >       But again here, user need to be careful and not       >>>>deadlock these       >>>>   >       queues with circular dependency of fences.       >>>>   >       >>>>   >       I prefer adding this later an as extension based on       >>>>whether it       >>>>   >       is really helping with the implementation.       >>>>   >       >>>>   >     I can tell you right now that having everything on a       single       >>>>   in-order       >>>>   >     queue will not get us the perf we want. What vulkan       >>>>really wants       >>>>   is one       >>>>   >     of two things:       >>>>   >      1. No implicit ordering of VM_BIND ops. They just       happen in       >>>>   whatever       >>>>   >     their dependencies are resolved and we ensure ordering       >>>>ourselves       >>>>   by       >>>>   >     having a syncobj in the VkQueue.       >>>>   >      2. The ability to create multiple VM_BIND queues. We       need at       >>>>   least 2       >>>>   >     but I don't see why there needs to be a limit besides       >>>>the limits       >>>>   the       >>>>   >     i915 API already has on the number of engines. Vulkan       could       >>>>   expose       >>>>   >     multiple sparse binding queues to the client if it's not       >>>>   arbitrarily       >>>>   >     limited.       >>>>       >>>>   Thanks Jason, Lionel.       >>>>       >>>>   Jason, what are you referring to when you say "limits the i915       API       >>>>   already       >>>>   has on the number of engines"? I am not sure if there is such       an uapi       >>>>   today.       >>>>       >>>> There's a limit of something like 64 total engines today based on       the       >>>> number of bits we can cram into the exec flags in execbuffer2. I       think       >>>> someone had an extended version that allowed more but I ripped it       out       >>>> because no one was using it. Of course, execbuffer3 might not       >>>>have that       >>>> problem at all.       >>>>       >>>       >>>Thanks Jason.       >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3       probably       >>>will not have this limiation. So, we need to define a       VM_BIND_MAX_QUEUE       >>>and somehow export it to user (I am thinking of embedding it in       >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'       meaning 2^n       >>>queues.       >>       >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which       execbuf3

Yup! That's exactly the limit I was talking about.

>>will also have. So, we can simply define in vm_bind/unbind       structures,       >>       >>#define I915_VM_BIND_MAX_QUEUE   64       >>        __u32 queue;       >>       >>I think that will keep things simple.       >       >Hmmm? What does execbuf2 limit has to do with how many engines       >hardware can have? I suggest not to do that.       >       >Change with added this:       >       >       if (set.num_engines > I915_EXEC_RING_MASK + 1)       >               return -EINVAL;       >       >To context creation needs to be undone and so let users create engine       >maps with all hardware engines, and let execbuf3 access them all.       >

Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.       Hence, I was using the same limit for VM_BIND queues (64, or 65 if we       make it N+1).       But, as discussed in other thread of this RFC series, we are planning       to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be       any uapi that limits the number of engines (and hence the vm_bind       queues       need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large       (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,       work_item and a linked list) lookup from the user specified queue       index.       Other option is to just put some hard limit (say 64 or 65) and use       an array of queues in VM (each created upon first use). I prefer this.

I don't get why a VM_BIND queue is any different from any other queue or     userspace-visible kernel object. But I'll leave those details up to     danvet or whoever else might be reviewing the implementation.     --Jason

I kind of agree here. Wouldn't be simpler to have the bind queue created like the others when we build the engine map?

For userspace it's then just matter of selecting the right queue ID when submitting.

If there is ever a possibility to have this work on the GPU, it would be all ready.

I did sync offline with Matt Brost on this. We can add a VM_BIND engine class and let user create VM_BIND engines (queues). The problem is, in i915 engine creating interface is bound to gem_context. So, in vm_bind ioctl, we would need both context_id and queue_idx for proper lookup of the user created engine. This is bit ackward as vm_bind is an interface to VM (address space) and has nothing to do with gem_context.

A gem_context has a single vm object right?

Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.

So it's just like picking up the vm like it's done at execbuffer time right now : eb->context->vm

...

Another problem is, if two VMs are binding with the same defined engine, binding on VM1 can get unnecessary blocked by binding on VM2 (which may be waiting on its in_fence).

Maybe I'm missing something, but how can you have 2 vm objects with a single gem_context right now?

...

So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind ioctl, and the queues are per VM.

Niranjana

...
Thanks,

-Lionel

Niranjana

>Regards, > >Tvrtko > >> >>Niranjana >> >>> >>>> I am trying to see how many queues we need and don't want it to be >>>> arbitrarily >>>> large and unduely blow up memory usage and complexity in i915 driver. >>>> >>>> I expect a Vulkan driver to use at most 2 in the vast majority >>>>of cases. I >>>> could imagine a client wanting to create more than 1 sparse >>>>queue in which >>>> case, it'll be N+1 but that's unlikely. As far as complexity >>>>goes, once >>>> you allow two, I don't think the complexity is going up by >>>>allowing N. As >>>> for memory usage, creating more queues means more memory. That's a >>>> trade-off that userspace can make. Again, the expected number >>>>here is 1 >>>> or 2 in the vast majority of cases so I don't think you need to worry. >>> >>>Ok, will start with n=3 meaning 8 queues. >>>That would require us create 8 workqueues. >>>We can change 'n' later if required. >>> >>>Niranjana >>> >>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>>operations and we >>>> don't >>>> > want any dependencies between them: >>>> > 1. Immediate. These happen right after BO creation or >>>>maybe as >>>> part of >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >>>>don't happen >>>> on a >>>> > queue and we don't want them serialized with anything. To >>>> synchronize >>>> > with submit, we'll have a syncobj in the VkDevice which is >>>> signaled by >>>> > all immediate bind operations and make submits wait on it. >>>> > 2. Queued (sparse): These happen on a VkQueue which may be the >>>> same as >>>> > a render/compute queue or may be its own queue. It's up to us >>>> what we >>>> > want to advertise. From the Vulkan API PoV, this is like any >>>> other >>>> > queue. Operations on it wait on and signal semaphores. If we >>>> have a >>>> > VM_BIND engine, we'd provide syncobjs to wait and >>>>signal just like >>>> we do >>>> > in execbuf(). >>>> > The important thing is that we don't want one type of >>>>operation to >>>> block >>>> > on the other. If immediate binds are blocking on sparse binds, >>>> it's >>>> > going to cause over-synchronization issues. >>>> > In terms of the internal implementation, I know that >>>>there's going >>>> to be >>>> > a lock on the VM and that we can't actually do these things in >>>> > parallel. That's fine. Once the dma_fences have signaled and >>>> we're >>>> >>>> Thats correct. It is like a single VM_BIND engine with >>>>multiple queues >>>> feeding to it. >>>> >>>> Right. As long as the queues themselves are independent and >>>>can block on >>>> dma_fences without holding up other queues, I think we're fine. >>>> >>>> > unblocked to do the bind operation, I don't care if >>>>there's a bit >>>> of >>>> > synchronization due to locking. That's expected. What >>>>we can't >>>> afford >>>> > to have is an immediate bind operation suddenly blocking on a >>>> sparse >>>> > operation which is blocked on a compute job that's going to run >>>> for >>>> > another 5ms. >>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the >>>> VM_BIND >>>> on other VMs. I am not sure about usecases here, but just wanted to >>>> clarify. >>>> >>>> Yes, that's what I would expect. >>>> --Jason >>>> >>>> Niranjana >>>> >>>> > For reference, Windows solves this by allowing arbitrarily many >>>> paging >>>> > queues (what they call a VM_BIND engine/queue). That >>>>design works >>>> > pretty well and solves the problems in question. >>>>Again, we could >>>> just >>>> > make everything out-of-order and require using syncobjs >>>>to order >>>> things >>>> > as userspace wants. That'd be fine too. >>>> > One more note while I'm here: danvet said something on >>>>IRC about >>>> VM_BIND >>>> > queues waiting for syncobjs to materialize. We don't really >>>> want/need >>>> > this. We already have all the machinery in userspace to handle >>>> > wait-before-signal and waiting for syncobj fences to >>>>materialize >>>> and >>>> > that machinery is on by default. It would actually >>>>take MORE work >>>> in >>>> > Mesa to turn it off and take advantage of the kernel >>>>being able to >>>> wait >>>> > for syncobjs to materialize. Also, getting that right is >>>> ridiculously >>>> > hard and I really don't want to get it wrong in kernel >>>>space. When we >>>> > do memory fences, wait-before-signal will be a thing. We don't >>>> need to >>>> > try and make it a thing for syncobj. >>>> > --Jason >>>> > >>>> > Thanks Jason, >>>> > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a >>>> sparse >>>> > queue that does not implement either graphics or compute >>>>operations >>>> : >>>> > >>>> > "While some implementations may include >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>> > support in queue families that also include >>>> > >>>> > graphics and compute support, other implementations may only >>>> expose a >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>> > >>>> > family." >>>> > >>>> > So it can all be all a vm_bind engine that just does bind/unbind >>>> > operations. >>>> > >>>> > But yes we need another engine for the immediate/non-sparse >>>> operations. >>>> > >>>> > -Lionel >>>> > >>>> > > >>>> > Daniel, any thoughts? >>>> > >>>> > Niranjana >>>> > >>>> > >Matt >>>> > > >>>> > >> >>>> > >> Sorry I noticed this late. >>>> > >> >>>> > >> >>>> > >> -Lionel >>>> > >> >>>> > >>

Niranjana Vishwanathapura

7:54 a.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:

...

On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:

...
On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:

...
On 09/06/2022 00:55, Jason Ekstrand wrote:

On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura     niranjana.vishwanathapura@intel.com wrote:

On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:       >       >       >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:       >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura       wrote:       >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:       >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura       >>>> niranjana.vishwanathapura@intel.com wrote:       >>>>       >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin       wrote:       >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:       >>>>   >       >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura       >>>>   > niranjana.vishwanathapura@intel.com wrote:       >>>>   >       >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew       >>>>Brost wrote:       >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel       Landwerlin       >>>>   wrote:       >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura       wrote:       >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start       >>>>   binding/unbinding       >>>>   >       the mapping in an       >>>>   >       >> > +async worker. The binding and unbinding will       >>>>work like a       >>>>   special       >>>>   >       GPU engine.       >>>>   >       >> > +The binding and unbinding operations are       serialized and       >>>>   will       >>>>   >       wait on specified       >>>>   >       >> > +input fences before the operation and will signal       the       >>>>   output       >>>>   >       fences upon the       >>>>   >       >> > +completion of the operation. Due to       serialization,       >>>>   completion of       >>>>   >       an operation       >>>>   >       >> > +will also indicate that all previous operations       >>>>are also       >>>>   >       complete.       >>>>   >       >>       >>>>   >       >> I guess we should avoid saying "will immediately       start       >>>>   >       binding/unbinding" if       >>>>   >       >> there are fences involved.       >>>>   >       >>       >>>>   >       >> And the fact that it's happening in an async       >>>>worker seem to       >>>>   imply       >>>>   >       it's not       >>>>   >       >> immediate.       >>>>   >       >>       >>>>   >       >>>>   >       Ok, will fix.       >>>>   >       This was added because in earlier design binding was       deferred       >>>>   until       >>>>   >       next execbuff.       >>>>   >       But now it is non-deferred (immediate in that sense).       >>>>But yah,       >>>>   this is       >>>>   >       confusing       >>>>   >       and will fix it.       >>>>   >       >>>>   >       >>       >>>>   >       >> I have a question on the behavior of the bind       >>>>operation when       >>>>   no       >>>>   >       input fence       >>>>   >       >> is provided. Let say I do :       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence1)       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence2)       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence3)       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> In what order are the fences going to be signaled?       >>>>   >       >>       >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?       >>>>   >       >>       >>>>   >       >> Because you wrote "serialized I assume it's : in       order       >>>>   >       >>       >>>>   >       >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that       >>>>bind and       >>>>   unbind       >>>>   >       will use       >>>>   >       the same queue and hence are ordered.       >>>>   >       >>>>   >       >>       >>>>   >       >> One thing I didn't realize is that because we only       get one       >>>>   >       "VM_BIND" engine,       >>>>   >       >> there is a disconnect from the Vulkan specification.       >>>>   >       >>       >>>>   >       >> In Vulkan VM_BIND operations are serialized but       >>>>per engine.       >>>>   >       >>       >>>>   >       >> So you could have something like this :       >>>>   >       >>       >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,       out_fence=fence2)       >>>>   >       >>       >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,       out_fence=fence4)       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> fence1 is not signaled       >>>>   >       >>       >>>>   >       >> fence3 is signaled       >>>>   >       >>       >>>>   >       >> So the second VM_BIND will proceed before the       >>>>first VM_BIND.       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> I guess we can deal with that scenario in       >>>>userspace by doing       >>>>   the       >>>>   >       wait       >>>>   >       >> ourselves in one thread per engines.       >>>>   >       >>       >>>>   >       >> But then it makes the VM_BIND input fences useless.       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> Daniel : what do you think? Should be rework this or       just       >>>>   deal with       >>>>   >       wait       >>>>   >       >> fences in userspace?       >>>>   >       >>       >>>>   >       >       >>>>   >       >My opinion is rework this but make the ordering via       >>>>an engine       >>>>   param       >>>>   >       optional.       >>>>   >       >       >>>>   >       >e.g. A VM can be configured so all binds are ordered       >>>>within the       >>>>   VM       >>>>   >       >       >>>>   >       >e.g. A VM can be configured so all binds accept an       engine       >>>>   argument       >>>>   >       (in       >>>>   >       >the case of the i915 likely this is a gem context       >>>>handle) and       >>>>   binds       >>>>   >       >ordered with respect to that engine.       >>>>   >       >       >>>>   >       >This gives UMDs options as the later likely consumes       >>>>more KMD       >>>>   >       resources       >>>>   >       >so if a different UMD can live with binds being       >>>>ordered within       >>>>   the VM       >>>>   >       >they can use a mode consuming less resources.       >>>>   >       >       >>>>   >       >>>>   >       I think we need to be careful here if we are looking       for some       >>>>   out of       >>>>   >       (submission) order completion of vm_bind/unbind.       >>>>   >       In-order completion means, in a batch of binds and       >>>>unbinds to be       >>>>   >       completed in-order, user only needs to specify       >>>>in-fence for the       >>>>   >       first bind/unbind call and the our-fence for the last       >>>>   bind/unbind       >>>>   >       call. Also, the VA released by an unbind call can be       >>>>re-used by       >>>>   >       any subsequent bind call in that in-order batch.       >>>>   >       >>>>   >       These things will break if binding/unbinding were to       >>>>be allowed       >>>>   to       >>>>   >       go out of order (of submission) and user need to be       extra       >>>>   careful       >>>>   >       not to run into pre-mature triggereing of out-fence and       bind       >>>>   failing       >>>>   >       as VA is still in use etc.       >>>>   >       >>>>   >       Also, VM_BIND binds the provided mapping on the       specified       >>>>   address       >>>>   >       space       >>>>   >       (VM). So, the uapi is not engine/context specific.       >>>>   >       >>>>   >       We can however add a 'queue' to the uapi which can be       >>>>one from       >>>>   the       >>>>   >       pre-defined queues,       >>>>   >       I915_VM_BIND_QUEUE_0       >>>>   >       I915_VM_BIND_QUEUE_1       >>>>   >       ...       >>>>   >       I915_VM_BIND_QUEUE_(N-1)       >>>>   >       >>>>   >       KMD will spawn an async work queue for each queue which       will       >>>>   only       >>>>   >       bind the mappings on that queue in the order of       submission.       >>>>   >       User can assign the queue to per engine or anything       >>>>like that.       >>>>   >       >>>>   >       But again here, user need to be careful and not       >>>>deadlock these       >>>>   >       queues with circular dependency of fences.       >>>>   >       >>>>   >       I prefer adding this later an as extension based on       >>>>whether it       >>>>   >       is really helping with the implementation.       >>>>   >       >>>>   >     I can tell you right now that having everything on a       single       >>>>   in-order       >>>>   >     queue will not get us the perf we want. What vulkan       >>>>really wants       >>>>   is one       >>>>   >     of two things:       >>>>   >      1. No implicit ordering of VM_BIND ops. They just       happen in       >>>>   whatever       >>>>   >     their dependencies are resolved and we ensure ordering       >>>>ourselves       >>>>   by       >>>>   >     having a syncobj in the VkQueue.       >>>>   >      2. The ability to create multiple VM_BIND queues. We       need at       >>>>   least 2       >>>>   >     but I don't see why there needs to be a limit besides       >>>>the limits       >>>>   the       >>>>   >     i915 API already has on the number of engines. Vulkan       could       >>>>   expose       >>>>   >     multiple sparse binding queues to the client if it's not       >>>>   arbitrarily       >>>>   >     limited.       >>>>       >>>>   Thanks Jason, Lionel.       >>>>       >>>>   Jason, what are you referring to when you say "limits the i915       API       >>>>   already       >>>>   has on the number of engines"? I am not sure if there is such       an uapi       >>>>   today.       >>>>       >>>> There's a limit of something like 64 total engines today based on       the       >>>> number of bits we can cram into the exec flags in execbuffer2. I       think       >>>> someone had an extended version that allowed more but I ripped it       out       >>>> because no one was using it. Of course, execbuffer3 might not       >>>>have that       >>>> problem at all.       >>>>       >>>       >>>Thanks Jason.       >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3       probably       >>>will not have this limiation. So, we need to define a       VM_BIND_MAX_QUEUE       >>>and somehow export it to user (I am thinking of embedding it in       >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'       meaning 2^n       >>>queues.       >>       >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which       execbuf3

Yup! That's exactly the limit I was talking about.

>>will also have. So, we can simply define in vm_bind/unbind       structures,       >>       >>#define I915_VM_BIND_MAX_QUEUE   64       >>        __u32 queue;       >>       >>I think that will keep things simple.       >       >Hmmm? What does execbuf2 limit has to do with how many engines       >hardware can have? I suggest not to do that.       >       >Change with added this:       >       >       if (set.num_engines > I915_EXEC_RING_MASK + 1)       >               return -EINVAL;       >       >To context creation needs to be undone and so let users create engine       >maps with all hardware engines, and let execbuf3 access them all.       >

Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.       Hence, I was using the same limit for VM_BIND queues (64, or 65 if we       make it N+1).       But, as discussed in other thread of this RFC series, we are planning       to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be       any uapi that limits the number of engines (and hence the vm_bind       queues       need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large       (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,       work_item and a linked list) lookup from the user specified queue       index.       Other option is to just put some hard limit (say 64 or 65) and use       an array of queues in VM (each created upon first use). I prefer this.

I don't get why a VM_BIND queue is any different from any other queue or     userspace-visible kernel object. But I'll leave those details up to     danvet or whoever else might be reviewing the implementation.     --Jason

I kind of agree here. Wouldn't be simpler to have the bind queue created like the others when we build the engine map?

For userspace it's then just matter of selecting the right queue ID when submitting.

If there is ever a possibility to have this work on the GPU, it would be all ready.

I did sync offline with Matt Brost on this. We can add a VM_BIND engine class and let user create VM_BIND engines (queues). The problem is, in i915 engine creating interface is bound to gem_context. So, in vm_bind ioctl, we would need both context_id and queue_idx for proper lookup of the user created engine. This is bit ackward as vm_bind is an interface to VM (address space) and has nothing to do with gem_context.

A gem_context has a single vm object right?

Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.

So it's just like picking up the vm like it's done at execbuffer time right now : eb->context->vm

Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained from the context? I think the interface is clean as a interface to VM. It is only that we don't have a clean way to create a raw VM_BIND engine (not associated with any context) with i915 uapi. May be we can add such an interface, but I don't think that is worth it (we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned above). Anyone has any thoughts?

...

...
Another problem is, if two VMs are binding with the same defined engine, binding on VM1 can get unnecessary blocked by binding on VM2 (which may be waiting on its in_fence).

Maybe I'm missing something, but how can you have 2 vm objects with a single gem_context right now?

No, we don't have 2 VMs for a gem_context. Say if ctx1 with vm1 and ctx2 with vm2. First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If those two queue indicies points to same underlying vm_bind engine, then the second vm_bind call gets blocked until the first vm_bind call's 'in' fence is triggered and bind completes.

With per VM queues, this is not a problem as two VMs will not endup sharing same queue.

BTW, I just posted a updated PATCH series. https://www.spinics.net/lists/dri-devel/msg350483.html

Niranjana

...

...
So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind ioctl, and the queues are per VM.

Niranjana

...
Thanks,

-Lionel

Niranjana

>Regards, > >Tvrtko > >> >>Niranjana >> >>> >>>> I am trying to see how many queues we need and don't want it to be >>>> arbitrarily >>>> large and unduely blow up memory usage and complexity in i915 driver. >>>> >>>> I expect a Vulkan driver to use at most 2 in the vast majority >>>>of cases. I >>>> could imagine a client wanting to create more than 1 sparse >>>>queue in which >>>> case, it'll be N+1 but that's unlikely. As far as complexity >>>>goes, once >>>> you allow two, I don't think the complexity is going up by >>>>allowing N. As >>>> for memory usage, creating more queues means more memory. That's a >>>> trade-off that userspace can make. Again, the expected number >>>>here is 1 >>>> or 2 in the vast majority of cases so I don't think you need to worry. >>> >>>Ok, will start with n=3 meaning 8 queues. >>>That would require us create 8 workqueues. >>>We can change 'n' later if required. >>> >>>Niranjana >>> >>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>>operations and we >>>> don't >>>> > want any dependencies between them: >>>> > 1. Immediate. These happen right after BO creation or >>>>maybe as >>>> part of >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >>>>don't happen >>>> on a >>>> > queue and we don't want them serialized with anything. To >>>> synchronize >>>> > with submit, we'll have a syncobj in the VkDevice which is >>>> signaled by >>>> > all immediate bind operations and make submits wait on it. >>>> > 2. Queued (sparse): These happen on a VkQueue which may be the >>>> same as >>>> > a render/compute queue or may be its own queue. It's up to us >>>> what we >>>> > want to advertise. From the Vulkan API PoV, this is like any >>>> other >>>> > queue. Operations on it wait on and signal semaphores. If we >>>> have a >>>> > VM_BIND engine, we'd provide syncobjs to wait and >>>>signal just like >>>> we do >>>> > in execbuf(). >>>> > The important thing is that we don't want one type of >>>>operation to >>>> block >>>> > on the other. If immediate binds are blocking on sparse binds, >>>> it's >>>> > going to cause over-synchronization issues. >>>> > In terms of the internal implementation, I know that >>>>there's going >>>> to be >>>> > a lock on the VM and that we can't actually do these things in >>>> > parallel. That's fine. Once the dma_fences have signaled and >>>> we're >>>> >>>> Thats correct. It is like a single VM_BIND engine with >>>>multiple queues >>>> feeding to it. >>>> >>>> Right. As long as the queues themselves are independent and >>>>can block on >>>> dma_fences without holding up other queues, I think we're fine. >>>> >>>> > unblocked to do the bind operation, I don't care if >>>>there's a bit >>>> of >>>> > synchronization due to locking. That's expected. What >>>>we can't >>>> afford >>>> > to have is an immediate bind operation suddenly blocking on a >>>> sparse >>>> > operation which is blocked on a compute job that's going to run >>>> for >>>> > another 5ms. >>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the >>>> VM_BIND >>>> on other VMs. I am not sure about usecases here, but just wanted to >>>> clarify. >>>> >>>> Yes, that's what I would expect. >>>> --Jason >>>> >>>> Niranjana >>>> >>>> > For reference, Windows solves this by allowing arbitrarily many >>>> paging >>>> > queues (what they call a VM_BIND engine/queue). That >>>>design works >>>> > pretty well and solves the problems in question. >>>>Again, we could >>>> just >>>> > make everything out-of-order and require using syncobjs >>>>to order >>>> things >>>> > as userspace wants. That'd be fine too. >>>> > One more note while I'm here: danvet said something on >>>>IRC about >>>> VM_BIND >>>> > queues waiting for syncobjs to materialize. We don't really >>>> want/need >>>> > this. We already have all the machinery in userspace to handle >>>> > wait-before-signal and waiting for syncobj fences to >>>>materialize >>>> and >>>> > that machinery is on by default. It would actually >>>>take MORE work >>>> in >>>> > Mesa to turn it off and take advantage of the kernel >>>>being able to >>>> wait >>>> > for syncobjs to materialize. Also, getting that right is >>>> ridiculously >>>> > hard and I really don't want to get it wrong in kernel >>>>space. When we >>>> > do memory fences, wait-before-signal will be a thing. We don't >>>> need to >>>> > try and make it a thing for syncobj. >>>> > --Jason >>>> > >>>> > Thanks Jason, >>>> > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a >>>> sparse >>>> > queue that does not implement either graphics or compute >>>>operations >>>> : >>>> > >>>> > "While some implementations may include >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>> > support in queue families that also include >>>> > >>>> > graphics and compute support, other implementations may only >>>> expose a >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>> > >>>> > family." >>>> > >>>> > So it can all be all a vm_bind engine that just does bind/unbind >>>> > operations. >>>> > >>>> > But yes we need another engine for the immediate/non-sparse >>>> operations. >>>> > >>>> > -Lionel >>>> > >>>> > > >>>> > Daniel, any thoughts? >>>> > >>>> > Niranjana >>>> > >>>> > >Matt >>>> > > >>>> > >> >>>> > >> Sorry I noticed this late. >>>> > >> >>>> > >> >>>> > >> -Lionel >>>> > >> >>>> > >>

Lionel Landwerlin

8:18 a.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:

...

On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:

...
On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:

...
On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:

...
On 09/06/2022 00:55, Jason Ekstrand wrote:

On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura     niranjana.vishwanathapura@intel.com wrote:

On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:       >       >       >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:       >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura       wrote:       >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:       >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura       >>>> niranjana.vishwanathapura@intel.com wrote:       >>>>       >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin       wrote:       >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:       >>>>   >       >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura       >>>>   > niranjana.vishwanathapura@intel.com wrote:       >>>>   >       >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew       >>>>Brost wrote:       >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel       Landwerlin       >>>>   wrote:       >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura       wrote:       >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start       >>>>   binding/unbinding       >>>>   >       the mapping in an       >>>>   >       >> > +async worker. The binding and unbinding will       >>>>work like a       >>>>   special       >>>>   >       GPU engine.       >>>>   >       >> > +The binding and unbinding operations are       serialized and       >>>>   will       >>>>   >       wait on specified       >>>>   >       >> > +input fences before the operation and will signal       the       >>>>   output       >>>>   >       fences upon the       >>>>   >       >> > +completion of the operation. Due to       serialization,       >>>>   completion of       >>>>   >       an operation       >>>>   >       >> > +will also indicate that all previous operations       >>>>are also       >>>>   >       complete.       >>>>   >       >>       >>>>   >       >> I guess we should avoid saying "will immediately       start       >>>>   >       binding/unbinding" if       >>>>   >       >> there are fences involved.       >>>>   >       >>       >>>>   >       >> And the fact that it's happening in an async       >>>>worker seem to       >>>>   imply       >>>>   >       it's not       >>>>   >       >> immediate.       >>>>   >       >>       >>>>   >       >>>>   >       Ok, will fix.       >>>>   >       This was added because in earlier design binding was       deferred       >>>>   until       >>>>   >       next execbuff.       >>>>   >       But now it is non-deferred (immediate in that sense).       >>>>But yah,       >>>>   this is       >>>>   >       confusing       >>>>   >       and will fix it.       >>>>   >       >>>>   >       >>       >>>>   >       >> I have a question on the behavior of the bind       >>>>operation when       >>>>   no       >>>>   >       input fence       >>>>   >       >> is provided. Let say I do :       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence1)       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence2)       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence3)       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> In what order are the fences going to be signaled?       >>>>   >       >>       >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?       >>>>   >       >>       >>>>   >       >> Because you wrote "serialized I assume it's : in       order       >>>>   >       >>       >>>>   >       >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that       >>>>bind and       >>>>   unbind       >>>>   >       will use       >>>>   >       the same queue and hence are ordered.       >>>>   >       >>>>   >       >>       >>>>   >       >> One thing I didn't realize is that because we only       get one       >>>>   >       "VM_BIND" engine,       >>>>   >       >> there is a disconnect from the Vulkan specification.       >>>>   >       >>       >>>>   >       >> In Vulkan VM_BIND operations are serialized but       >>>>per engine.       >>>>   >       >>       >>>>   >       >> So you could have something like this :       >>>>   >       >>       >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,       out_fence=fence2)       >>>>   >       >>       >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,       out_fence=fence4)       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> fence1 is not signaled       >>>>   >       >>       >>>>   >       >> fence3 is signaled       >>>>   >       >>       >>>>   >       >> So the second VM_BIND will proceed before the       >>>>first VM_BIND.       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> I guess we can deal with that scenario in       >>>>userspace by doing       >>>>   the       >>>>   >       wait       >>>>   >       >> ourselves in one thread per engines.       >>>>   >       >>       >>>>   >       >> But then it makes the VM_BIND input fences useless.       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> Daniel : what do you think? Should be rework this or       just       >>>>   deal with       >>>>   >       wait       >>>>   >       >> fences in userspace?       >>>>   >       >>       >>>>   >       >       >>>>   >       >My opinion is rework this but make the ordering via       >>>>an engine       >>>>   param       >>>>   >       optional.       >>>>   >       >       >>>>   >       >e.g. A VM can be configured so all binds are ordered       >>>>within the       >>>>   VM       >>>>   >       >       >>>>   >       >e.g. A VM can be configured so all binds accept an       engine       >>>>   argument       >>>>   >       (in       >>>>   >       >the case of the i915 likely this is a gem context       >>>>handle) and       >>>>   binds       >>>>   >       >ordered with respect to that engine.       >>>>   >       >       >>>>   >       >This gives UMDs options as the later likely consumes       >>>>more KMD       >>>>   >       resources       >>>>   >       >so if a different UMD can live with binds being       >>>>ordered within       >>>>   the VM       >>>>   >       >they can use a mode consuming less resources.       >>>>   >       >       >>>>   >       >>>>   >       I think we need to be careful here if we are looking       for some       >>>>   out of       >>>>   >       (submission) order completion of vm_bind/unbind.       >>>>   >       In-order completion means, in a batch of binds and       >>>>unbinds to be       >>>>   >       completed in-order, user only needs to specify       >>>>in-fence for the       >>>>   >       first bind/unbind call and the our-fence for the last       >>>>   bind/unbind       >>>>   >       call. Also, the VA released by an unbind call can be       >>>>re-used by       >>>>   >       any subsequent bind call in that in-order batch.       >>>>   >       >>>>   >       These things will break if binding/unbinding were to       >>>>be allowed       >>>>   to       >>>>   >       go out of order (of submission) and user need to be       extra       >>>>   careful       >>>>   >       not to run into pre-mature triggereing of out-fence and       bind       >>>>   failing       >>>>   >       as VA is still in use etc.       >>>>   >       >>>>   >       Also, VM_BIND binds the provided mapping on the       specified       >>>>   address       >>>>   >       space       >>>>   >       (VM). So, the uapi is not engine/context specific.       >>>>   >       >>>>   >       We can however add a 'queue' to the uapi which can be       >>>>one from       >>>>   the       >>>>   >       pre-defined queues,       >>>>   >       I915_VM_BIND_QUEUE_0       >>>>   >       I915_VM_BIND_QUEUE_1       >>>>   >       ...       >>>>   >       I915_VM_BIND_QUEUE_(N-1)       >>>>   >       >>>>   >       KMD will spawn an async work queue for each queue which       will       >>>>   only       >>>>   >       bind the mappings on that queue in the order of       submission.       >>>>   >       User can assign the queue to per engine or anything       >>>>like that.       >>>>   >       >>>>   >       But again here, user need to be careful and not       >>>>deadlock these       >>>>   >       queues with circular dependency of fences.       >>>>   >       >>>>   >       I prefer adding this later an as extension based on       >>>>whether it       >>>>   >       is really helping with the implementation.       >>>>   >       >>>>   >     I can tell you right now that having everything on a       single       >>>>   in-order       >>>>   >     queue will not get us the perf we want. What vulkan       >>>>really wants       >>>>   is one       >>>>   >     of two things:       >>>>   >      1. No implicit ordering of VM_BIND ops. They just       happen in       >>>>   whatever       >>>>   >     their dependencies are resolved and we ensure ordering       >>>>ourselves       >>>>   by       >>>>   >     having a syncobj in the VkQueue.       >>>>   >      2. The ability to create multiple VM_BIND queues. We       need at       >>>>   least 2       >>>>   >     but I don't see why there needs to be a limit besides       >>>>the limits       >>>>   the       >>>>   >     i915 API already has on the number of engines. Vulkan       could       >>>>   expose       >>>>   >     multiple sparse binding queues to the client if it's not       >>>>   arbitrarily       >>>>   >     limited.       >>>>       >>>>   Thanks Jason, Lionel.       >>>>       >>>>   Jason, what are you referring to when you say "limits the i915       API       >>>>   already       >>>>   has on the number of engines"? I am not sure if there is such       an uapi       >>>>   today.       >>>>       >>>> There's a limit of something like 64 total engines today based on       the       >>>> number of bits we can cram into the exec flags in execbuffer2. I       think       >>>> someone had an extended version that allowed more but I ripped it       out       >>>> because no one was using it. Of course, execbuffer3 might not       >>>>have that       >>>> problem at all.       >>>>       >>>       >>>Thanks Jason.       >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3       probably       >>>will not have this limiation. So, we need to define a       VM_BIND_MAX_QUEUE       >>>and somehow export it to user (I am thinking of embedding it in       >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'       meaning 2^n       >>>queues.       >>       >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which       execbuf3

Yup! That's exactly the limit I was talking about.

>>will also have. So, we can simply define in vm_bind/unbind       structures,       >>       >>#define I915_VM_BIND_MAX_QUEUE   64       >>        __u32 queue;       >>       >>I think that will keep things simple.       >       >Hmmm? What does execbuf2 limit has to do with how many engines       >hardware can have? I suggest not to do that.       >       >Change with added this:       >       >       if (set.num_engines > I915_EXEC_RING_MASK + 1)       >               return -EINVAL;       >       >To context creation needs to be undone and so let users create engine       >maps with all hardware engines, and let execbuf3 access them all.       >

Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.       Hence, I was using the same limit for VM_BIND queues (64, or 65 if we       make it N+1).       But, as discussed in other thread of this RFC series, we are planning       to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be       any uapi that limits the number of engines (and hence the vm_bind       queues       need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large       (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,       work_item and a linked list) lookup from the user specified queue       index.       Other option is to just put some hard limit (say 64 or 65) and use       an array of queues in VM (each created upon first use). I prefer this.

I don't get why a VM_BIND queue is any different from any other queue or     userspace-visible kernel object. But I'll leave those details up to     danvet or whoever else might be reviewing the implementation.     --Jason

I kind of agree here. Wouldn't be simpler to have the bind queue created like the others when we build the engine map?

For userspace it's then just matter of selecting the right queue ID when submitting.

If there is ever a possibility to have this work on the GPU, it would be all ready.

I did sync offline with Matt Brost on this. We can add a VM_BIND engine class and let user create VM_BIND engines (queues). The problem is, in i915 engine creating interface is bound to gem_context. So, in vm_bind ioctl, we would need both context_id and queue_idx for proper lookup of the user created engine. This is bit ackward as vm_bind is an interface to VM (address space) and has nothing to do with gem_context.

A gem_context has a single vm object right?

Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.

So it's just like picking up the vm like it's done at execbuffer time right now : eb->context->vm

Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained from the context?

Yes, because if we go for engines, they're associated with a context and so also associated with the VM bound to the context.

...

I think the interface is clean as a interface to VM. It is only that we don't have a clean way to create a raw VM_BIND engine (not associated with any context) with i915 uapi. May be we can add such an interface, but I don't think that is worth it (we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned above). Anyone has any thoughts?

...
...
Another problem is, if two VMs are binding with the same defined engine, binding on VM1 can get unnecessary blocked by binding on VM2 (which may be waiting on its in_fence).

Maybe I'm missing something, but how can you have 2 vm objects with a single gem_context right now?

No, we don't have 2 VMs for a gem_context. Say if ctx1 with vm1 and ctx2 with vm2. First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If those two queue indicies points to same underlying vm_bind engine, then the second vm_bind call gets blocked until the first vm_bind call's 'in' fence is triggered and bind completes.

With per VM queues, this is not a problem as two VMs will not endup sharing same queue.

BTW, I just posted a updated PATCH series. https://www.spinics.net/lists/dri-devel/msg350483.html

Niranjana

...
...
So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind ioctl, and the queues are per VM.

Niranjana

...
Thanks,

-Lionel

Niranjana

>Regards, > >Tvrtko > >> >>Niranjana >> >>> >>>> I am trying to see how many queues we need and don't want it to be >>>> arbitrarily >>>> large and unduely blow up memory usage and complexity in i915 driver. >>>> >>>> I expect a Vulkan driver to use at most 2 in the vast majority >>>>of cases. I >>>> could imagine a client wanting to create more than 1 sparse >>>>queue in which >>>> case, it'll be N+1 but that's unlikely. As far as complexity >>>>goes, once >>>> you allow two, I don't think the complexity is going up by >>>>allowing N. As >>>> for memory usage, creating more queues means more memory. That's a >>>> trade-off that userspace can make. Again, the expected number >>>>here is 1 >>>> or 2 in the vast majority of cases so I don't think you need to worry. >>> >>>Ok, will start with n=3 meaning 8 queues. >>>That would require us create 8 workqueues. >>>We can change 'n' later if required. >>> >>>Niranjana >>> >>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>>operations and we >>>> don't >>>> > want any dependencies between them: >>>> > 1. Immediate. These happen right after BO creation or >>>>maybe as >>>> part of >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >>>>don't happen >>>> on a >>>> > queue and we don't want them serialized with anything. To >>>> synchronize >>>> > with submit, we'll have a syncobj in the VkDevice which is >>>> signaled by >>>> > all immediate bind operations and make submits wait on it. >>>> > 2. Queued (sparse): These happen on a VkQueue which may be the >>>> same as >>>> > a render/compute queue or may be its own queue. It's up to us >>>> what we >>>> > want to advertise. From the Vulkan API PoV, this is like any >>>> other >>>> > queue. Operations on it wait on and signal semaphores. If we >>>> have a >>>> > VM_BIND engine, we'd provide syncobjs to wait and >>>>signal just like >>>> we do >>>> > in execbuf(). >>>> > The important thing is that we don't want one type of >>>>operation to >>>> block >>>> > on the other. If immediate binds are blocking on sparse binds, >>>> it's >>>> > going to cause over-synchronization issues. >>>> > In terms of the internal implementation, I know that >>>>there's going >>>> to be >>>> > a lock on the VM and that we can't actually do these things in >>>> > parallel. That's fine. Once the dma_fences have signaled and >>>> we're >>>> >>>> Thats correct. It is like a single VM_BIND engine with >>>>multiple queues >>>> feeding to it. >>>> >>>> Right. As long as the queues themselves are independent and >>>>can block on >>>> dma_fences without holding up other queues, I think we're fine. >>>> >>>> > unblocked to do the bind operation, I don't care if >>>>there's a bit >>>> of >>>> > synchronization due to locking. That's expected. What >>>>we can't >>>> afford >>>> > to have is an immediate bind operation suddenly blocking on a >>>> sparse >>>> > operation which is blocked on a compute job that's going to run >>>> for >>>> > another 5ms. >>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the >>>> VM_BIND >>>> on other VMs. I am not sure about usecases here, but just wanted to >>>> clarify. >>>> >>>> Yes, that's what I would expect. >>>> --Jason >>>> >>>> Niranjana >>>> >>>> > For reference, Windows solves this by allowing arbitrarily many >>>> paging >>>> > queues (what they call a VM_BIND engine/queue). That >>>>design works >>>> > pretty well and solves the problems in question. >>>>Again, we could >>>> just >>>> > make everything out-of-order and require using syncobjs >>>>to order >>>> things >>>> > as userspace wants. That'd be fine too. >>>> > One more note while I'm here: danvet said something on >>>>IRC about >>>> VM_BIND >>>> > queues waiting for syncobjs to materialize. We don't really >>>> want/need >>>> > this. We already have all the machinery in userspace to handle >>>> > wait-before-signal and waiting for syncobj fences to >>>>materialize >>>> and >>>> > that machinery is on by default. It would actually >>>>take MORE work >>>> in >>>> > Mesa to turn it off and take advantage of the kernel >>>>being able to >>>> wait >>>> > for syncobjs to materialize. Also, getting that right is >>>> ridiculously >>>> > hard and I really don't want to get it wrong in kernel >>>>space. �� When we >>>> > do memory fences, wait-before-signal will be a thing. We don't >>>> need to >>>> > try and make it a thing for syncobj. >>>> > --Jason >>>> > >>>> > Thanks Jason, >>>> > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a >>>> sparse >>>> > queue that does not implement either graphics or compute >>>>operations >>>> : >>>> > >>>> > "While some implementations may include >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>> > support in queue families that also include >>>> > >>>> > graphics and compute support, other implementations may only >>>> expose a >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>> > >>>> > family." >>>> > >>>> > So it can all be all a vm_bind engine that just does bind/unbind >>>> > operations. >>>> > >>>> > But yes we need another engine for the immediate/non-sparse >>>> operations. >>>> > >>>> > -Lionel >>>> > >>>> > > >>>> > Daniel, any thoughts? >>>> > >>>> > Niranjana >>>> > >>>> > >Matt >>>> > > >>>> > >> >>>> > >> Sorry I noticed this late. >>>> > >> >>>> > >> >>>> > >> -Lionel >>>> > >> >>>> > >>

Niranjana Vishwanathapura

5:42 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:

...

On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:

...
On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:

...
On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:

...
On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:

...
On 09/06/2022 00:55, Jason Ekstrand wrote:

On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura     niranjana.vishwanathapura@intel.com wrote:

On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:       >       >       >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:       >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura       wrote:       >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:       >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura       >>>> niranjana.vishwanathapura@intel.com wrote:       >>>>       >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin       wrote:       >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:       >>>>   >       >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura       >>>>   > niranjana.vishwanathapura@intel.com wrote:       >>>>   >       >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew       >>>>Brost wrote:       >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel       Landwerlin       >>>>   wrote:       >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura       wrote:       >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start       >>>>   binding/unbinding       >>>>   >       the mapping in an       >>>>   >       >> > +async worker. The binding and unbinding will       >>>>work like a       >>>>   special       >>>>   >       GPU engine.       >>>>   >       >> > +The binding and unbinding operations are       serialized and       >>>>   will       >>>>   >       wait on specified       >>>>   >       >> > +input fences before the operation and will signal       the       >>>>   output       >>>>   >       fences upon the       >>>>   >       >> > +completion of the operation. Due to       serialization,       >>>>   completion of       >>>>   >       an operation       >>>>   >       >> > +will also indicate that all previous operations       >>>>are also       >>>>   >       complete.       >>>>   >       >>       >>>>   >       >> I guess we should avoid saying "will immediately       start       >>>>   >       binding/unbinding" if       >>>>   >       >> there are fences involved.       >>>>   >       >>       >>>>   >       >> And the fact that it's happening in an async       >>>>worker seem to       >>>>   imply       >>>>   >       it's not       >>>>   >       >> immediate.       >>>>   >       >>       >>>>   >       >>>>   >       Ok, will fix.       >>>>   >       This was added because in earlier design binding was       deferred       >>>>   until       >>>>   >       next execbuff.       >>>>   >       But now it is non-deferred (immediate in that sense).       >>>>But yah,       >>>>   this is       >>>>   >       confusing       >>>>   >       and will fix it.       >>>>   >       >>>>   >       >>       >>>>   >       >> I have a question on the behavior of the bind       >>>>operation when       >>>>   no       >>>>   >       input fence       >>>>   >       >> is provided. Let say I do :       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence1)       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence2)       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence3)       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> In what order are the fences going to be signaled?       >>>>   >       >>       >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?       >>>>   >       >>       >>>>   >       >> Because you wrote "serialized I assume it's : in       order       >>>>   >       >>       >>>>   >       >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that       >>>>bind and       >>>>   unbind       >>>>   >       will use       >>>>   >       the same queue and hence are ordered.       >>>>   >       >>>>   >       >>       >>>>   >       >> One thing I didn't realize is that because we only       get one       >>>>   >       "VM_BIND" engine,       >>>>   >       >> there is a disconnect from the Vulkan specification.       >>>>   >       >>       >>>>   >       >> In Vulkan VM_BIND operations are serialized but       >>>>per engine.       >>>>   >       >>       >>>>   >       >> So you could have something like this :       >>>>   >       >>       >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,       out_fence=fence2)       >>>>   >       >>       >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,       out_fence=fence4)       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> fence1 is not signaled       >>>>   >       >>       >>>>   >       >> fence3 is signaled       >>>>   >       >>       >>>>   >       >> So the second VM_BIND will proceed before the       >>>>first VM_BIND.       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> I guess we can deal with that scenario in       >>>>userspace by doing       >>>>   the       >>>>   >       wait       >>>>   >       >> ourselves in one thread per engines.       >>>>   >       >>       >>>>   >       >> But then it makes the VM_BIND input fences useless.       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> Daniel : what do you think? Should be rework this or       just       >>>>   deal with       >>>>   >       wait       >>>>   >       >> fences in userspace?       >>>>   >       >>       >>>>   >       >       >>>>   >       >My opinion is rework this but make the ordering via       >>>>an engine       >>>>   param       >>>>   >       optional.       >>>>   >       >       >>>>   >       >e.g. A VM can be configured so all binds are ordered       >>>>within the       >>>>   VM       >>>>   >       >       >>>>   >       >e.g. A VM can be configured so all binds accept an       engine       >>>>   argument       >>>>   >       (in       >>>>   >       >the case of the i915 likely this is a gem context       >>>>handle) and       >>>>   binds       >>>>   >       >ordered with respect to that engine.       >>>>   >       >       >>>>   >       >This gives UMDs options as the later likely consumes       >>>>more KMD       >>>>   >       resources       >>>>   >       >so if a different UMD can live with binds being       >>>>ordered within       >>>>   the VM       >>>>   >       >they can use a mode consuming less resources.       >>>>   >       >       >>>>   >       >>>>   >       I think we need to be careful here if we are looking       for some       >>>>   out of       >>>>   >       (submission) order completion of vm_bind/unbind.       >>>>   >       In-order completion means, in a batch of binds and       >>>>unbinds to be       >>>>   >       completed in-order, user only needs to specify       >>>>in-fence for the       >>>>   >       first bind/unbind call and the our-fence for the last       >>>>   bind/unbind       >>>>   >       call. Also, the VA released by an unbind call can be       >>>>re-used by       >>>>   >       any subsequent bind call in that in-order batch.       >>>>   >       >>>>   >       These things will break if binding/unbinding were to       >>>>be allowed       >>>>   to       >>>>   >       go out of order (of submission) and user need to be       extra       >>>>   careful       >>>>   >       not to run into pre-mature triggereing of out-fence and       bind       >>>>   failing       >>>>   >       as VA is still in use etc.       >>>>   >       >>>>   >       Also, VM_BIND binds the provided mapping on the       specified       >>>>   address       >>>>   >       space       >>>>   >       (VM). So, the uapi is not engine/context specific.       >>>>   >       >>>>   >       We can however add a 'queue' to the uapi which can be       >>>>one from       >>>>   the       >>>>   >       pre-defined queues,       >>>>   >       I915_VM_BIND_QUEUE_0       >>>>   >       I915_VM_BIND_QUEUE_1       >>>>   >       ...       >>>>   >       I915_VM_BIND_QUEUE_(N-1)       >>>>   >       >>>>   >       KMD will spawn an async work queue for each queue which       will       >>>>   only       >>>>   >       bind the mappings on that queue in the order of       submission.       >>>>   >       User can assign the queue to per engine or anything       >>>>like that.       >>>>   >       >>>>   >       But again here, user need to be careful and not       >>>>deadlock these       >>>>   >       queues with circular dependency of fences.       >>>>   >       >>>>   >       I prefer adding this later an as extension based on       >>>>whether it       >>>>   >       is really helping with the implementation.       >>>>   >       >>>>   >     I can tell you right now that having everything on a       single       >>>>   in-order       >>>>   >     queue will not get us the perf we want. What vulkan       >>>>really wants       >>>>   is one       >>>>   >     of two things:       >>>>   >      1. No implicit ordering of VM_BIND ops. They just       happen in       >>>>   whatever       >>>>   >     their dependencies are resolved and we ensure ordering       >>>>ourselves       >>>>   by       >>>>   >     having a syncobj in the VkQueue.       >>>>   >      2. The ability to create multiple VM_BIND queues. We       need at       >>>>   least 2       >>>>   >     but I don't see why there needs to be a limit besides       >>>>the limits       >>>>   the       >>>>   >     i915 API already has on the number of engines. Vulkan       could       >>>>   expose       >>>>   >     multiple sparse binding queues to the client if it's not       >>>>   arbitrarily       >>>>   >     limited.       >>>>       >>>>   Thanks Jason, Lionel.       >>>>       >>>>   Jason, what are you referring to when you say "limits the i915       API       >>>>   already       >>>>   has on the number of engines"? I am not sure if there is such       an uapi       >>>>   today.       >>>>       >>>> There's a limit of something like 64 total engines today based on       the       >>>> number of bits we can cram into the exec flags in execbuffer2. I       think       >>>> someone had an extended version that allowed more but I ripped it       out       >>>> because no one was using it. Of course, execbuffer3 might not       >>>>have that       >>>> problem at all.       >>>>       >>>       >>>Thanks Jason.       >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3       probably       >>>will not have this limiation. So, we need to define a       VM_BIND_MAX_QUEUE       >>>and somehow export it to user (I am thinking of embedding it in       >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'       meaning 2^n       >>>queues.       >>       >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which       execbuf3

Yup! That's exactly the limit I was talking about.

>>will also have. So, we can simply define in vm_bind/unbind       structures,       >>       >>#define I915_VM_BIND_MAX_QUEUE   64       >>        __u32 queue;       >>       >>I think that will keep things simple.       >       >Hmmm? What does execbuf2 limit has to do with how many engines       >hardware can have? I suggest not to do that.       >       >Change with added this:       >       >       if (set.num_engines > I915_EXEC_RING_MASK + 1)       >               return -EINVAL;       >       >To context creation needs to be undone and so let users create engine       >maps with all hardware engines, and let execbuf3 access them all.       >

Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.       Hence, I was using the same limit for VM_BIND queues (64, or 65 if we       make it N+1).       But, as discussed in other thread of this RFC series, we are planning       to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be       any uapi that limits the number of engines (and hence the vm_bind       queues       need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large       (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,       work_item and a linked list) lookup from the user specified queue       index.       Other option is to just put some hard limit (say 64 or 65) and use       an array of queues in VM (each created upon first use). I prefer this.

I don't get why a VM_BIND queue is any different from any other queue or     userspace-visible kernel object. But I'll leave those details up to     danvet or whoever else might be reviewing the implementation.     --Jason

I kind of agree here. Wouldn't be simpler to have the bind queue created like the others when we build the engine map?

For userspace it's then just matter of selecting the right queue ID when submitting.

If there is ever a possibility to have this work on the GPU, it would be all ready.

I did sync offline with Matt Brost on this. We can add a VM_BIND engine class and let user create VM_BIND engines (queues). The problem is, in i915 engine creating interface is bound to gem_context. So, in vm_bind ioctl, we would need both context_id and queue_idx for proper lookup of the user created engine. This is bit ackward as vm_bind is an interface to VM (address space) and has nothing to do with gem_context.

A gem_context has a single vm object right?

Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.

So it's just like picking up the vm like it's done at execbuffer time right now : eb->context->vm

Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained from the context?

Yes, because if we go for engines, they're associated with a context and so also associated with the VM bound to the context.

Hmm...context doesn't sould like the right interface. It should be VM and engine (independent of context). Engine can be virtual or soft engine (kernel thread), each with its own queue. We can add an interface to create such engines (independent of context). But we are anway implicitly creating it when user uses a new queue_idx. If in future we have hardware engines for VM_BIND operation, we can have that explicit inteface to create engine instances and the queue_index in vm_bind/unbind will point to those engines. Anyone has any thoughts? Daniel?

Niranjana

...

...
I think the interface is clean as a interface to VM. It is only that we don't have a clean way to create a raw VM_BIND engine (not associated with any context) with i915 uapi. May be we can add such an interface, but I don't think that is worth it (we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned above). Anyone has any thoughts?

...
...
Another problem is, if two VMs are binding with the same defined engine, binding on VM1 can get unnecessary blocked by binding on VM2 (which may be waiting on its in_fence).

Maybe I'm missing something, but how can you have 2 vm objects with a single gem_context right now?

No, we don't have 2 VMs for a gem_context. Say if ctx1 with vm1 and ctx2 with vm2. First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If those two queue indicies points to same underlying vm_bind engine, then the second vm_bind call gets blocked until the first vm_bind call's 'in' fence is triggered and bind completes.

With per VM queues, this is not a problem as two VMs will not endup sharing same queue.

BTW, I just posted a updated PATCH series. https://www.spinics.net/lists/dri-devel/msg350483.html

Niranjana

...
...
So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind ioctl, and the queues are per VM.

Niranjana

...
Thanks,

-Lionel

Niranjana

>Regards, > >Tvrtko > >> >>Niranjana >> >>> >>>> I am trying to see how many queues we need and don't want it to be >>>> arbitrarily >>>> large and unduely blow up memory usage and complexity in i915 driver. >>>> >>>> I expect a Vulkan driver to use at most 2 in the vast majority >>>>of cases. I >>>> could imagine a client wanting to create more than 1 sparse >>>>queue in which >>>> case, it'll be N+1 but that's unlikely. As far as complexity >>>>goes, once >>>> you allow two, I don't think the complexity is going up by >>>>allowing N. As >>>> for memory usage, creating more queues means more memory. That's a >>>> trade-off that userspace can make. Again, the expected number >>>>here is 1 >>>> or 2 in the vast majority of cases so I don't think you need to worry. >>> >>>Ok, will start with n=3 meaning 8 queues. >>>That would require us create 8 workqueues. >>>We can change 'n' later if required. >>> >>>Niranjana >>> >>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>>operations and we >>>> don't >>>> > want any dependencies between them: >>>> > 1. Immediate. These happen right after BO creation or >>>>maybe as >>>> part of >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >>>>don't happen >>>> on a >>>> > queue and we don't want them serialized with anything. To >>>> synchronize >>>> > with submit, we'll have a syncobj in the VkDevice which is >>>> signaled by >>>> > all immediate bind operations and make submits wait on it. >>>> > 2. Queued (sparse): These happen on a VkQueue which may be the >>>> same as >>>> > a render/compute queue or may be its own queue. It's up to us >>>> what we >>>> > want to advertise. From the Vulkan API PoV, this is like any >>>> other >>>> > queue. Operations on it wait on and signal semaphores. If we >>>> have a >>>> > VM_BIND engine, we'd provide syncobjs to wait and >>>>signal just like >>>> we do >>>> > in execbuf(). >>>> > The important thing is that we don't want one type of >>>>operation to >>>> block >>>> > on the other. If immediate binds are blocking on sparse binds, >>>> it's >>>> > going to cause over-synchronization issues. >>>> > In terms of the internal implementation, I know that >>>>there's going >>>> to be >>>> > a lock on the VM and that we can't actually do these things in >>>> > parallel. That's fine. Once the dma_fences have signaled and >>>> we're >>>> >>>> Thats correct. It is like a single VM_BIND engine with >>>>multiple queues >>>> feeding to it. >>>> >>>> Right. As long as the queues themselves are independent and >>>>can block on >>>> dma_fences without holding up other queues, I think we're fine. >>>> >>>> > unblocked to do the bind operation, I don't care if >>>>there's a bit >>>> of >>>> > synchronization due to locking. That's expected. What >>>>we can't >>>> afford >>>> > to have is an immediate bind operation suddenly blocking on a >>>> sparse >>>> > operation which is blocked on a compute job that's going to run >>>> for >>>> > another 5ms. >>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the >>>> VM_BIND >>>> on other VMs. I am not sure about usecases here, but just wanted to >>>> clarify. >>>> >>>> Yes, that's what I would expect. >>>> --Jason >>>> >>>> Niranjana >>>> >>>> > For reference, Windows solves this by allowing arbitrarily many >>>> paging >>>> > queues (what they call a VM_BIND engine/queue). That >>>>design works >>>> > pretty well and solves the problems in question. >>>>Again, we could >>>> just >>>> > make everything out-of-order and require using syncobjs >>>>to order >>>> things >>>> > as userspace wants. That'd be fine too. >>>> > One more note while I'm here: danvet said something on >>>>IRC about >>>> VM_BIND >>>> > queues waiting for syncobjs to materialize. We don't really >>>> want/need >>>> > this. We already have all the machinery in userspace to handle >>>> > wait-before-signal and waiting for syncobj fences to >>>>materialize >>>> and >>>> > that machinery is on by default. It would actually >>>>take MORE work >>>> in >>>> > Mesa to turn it off and take advantage of the kernel >>>>being able to >>>> wait >>>> > for syncobjs to materialize. Also, getting that right is >>>> ridiculously >>>> > hard and I really don't want to get it wrong in kernel >>>>space. �� When we >>>> > do memory fences, wait-before-signal will be a thing. We don't >>>> need to >>>> > try and make it a thing for syncobj. >>>> > --Jason >>>> > >>>> > Thanks Jason, >>>> > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a >>>> sparse >>>> > queue that does not implement either graphics or compute >>>>operations >>>> : >>>> > >>>> > "While some implementations may include >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>> > support in queue families that also include >>>> > >>>> > graphics and compute support, other implementations may only >>>> expose a >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>> > >>>> > family." >>>> > >>>> > So it can all be all a vm_bind engine that just does bind/unbind >>>> > operations. >>>> > >>>> > But yes we need another engine for the immediate/non-sparse >>>> operations. >>>> > >>>> > -Lionel >>>> > >>>> > > >>>> > Daniel, any thoughts? >>>> > >>>> > Niranjana >>>> > >>>> > >Matt >>>> > > >>>> > >> >>>> > >> Sorry I noticed this late. >>>> > >> >>>> > >> >>>> > >> -Lionel >>>> > >> >>>> > >>

Zeng, Oak

13 Jun 13 Jun

1:33 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

Regards, Oak

...

-----Original Message----- From: Intel-gfx intel-gfx-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: June 10, 2022 1:43 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Intel GFX intel-gfx@lists.freedesktop.org; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P chris.p.wilson@intel.com; Vetter, Daniel daniel.vetter@intel.com; Christian König christian.koenig@amd.com Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:

...
On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:

...
On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:

...
On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:

...
On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote:

...
On 09/06/2022 00:55, Jason Ekstrand wrote:

On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura     niranjana.vishwanathapura@intel.com wrote:

On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote:       >       >       >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote:       >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana Vishwanathapura       wrote:       >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason Ekstrand wrote:       >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura       >>>> niranjana.vishwanathapura@intel.com wrote:       >>>>       >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel Landwerlin       wrote:       >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote:       >>>>   >       >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana Vishwanathapura       >>>>   > niranjana.vishwanathapura@intel.com wrote:       >>>>   >       >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew       >>>>Brost wrote:       >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel       Landwerlin       >>>>   wrote:       >>>>   >       >> On 17/05/2022 21:32, Niranjana Vishwanathapura       wrote:       >>>>   >       >> > +VM_BIND/UNBIND ioctl will immediately start       >>>>   binding/unbinding       >>>>   >       the mapping in an       >>>>   >       >> > +async worker. The binding and unbinding will       >>>>work like a       >>>>   special       >>>>   >       GPU engine.       >>>>   >       >> > +The binding and unbinding operations are       serialized and       >>>>   will       >>>>   >       wait on specified       >>>>   >       >> > +input fences before the operation and will signal       the       >>>>   output       >>>>   >       fences upon the       >>>>   >       >> > +completion of the operation. Due to       serialization,       >>>>   completion of       >>>>   >       an operation       >>>>   >       >> > +will also indicate that all previous operations       >>>>are also       >>>>   >       complete.       >>>>   >       >>       >>>>   >       >> I guess we should avoid saying "will immediately       start       >>>>   >       binding/unbinding" if       >>>>   >       >> there are fences involved.       >>>>   >       >>       >>>>   >       >> And the fact that it's happening in an async       >>>>worker seem to       >>>>   imply       >>>>   >       it's not       >>>>   >       >> immediate.       >>>>   >       >>       >>>>   >       >>>>   >       Ok, will fix.       >>>>   >       This was added because in earlier design binding was       deferred       >>>>   until       >>>>   >       next execbuff.       >>>>   >       But now it is non-deferred (immediate in that sense).       >>>>But yah,       >>>>   this is       >>>>   >       confusing       >>>>   >       and will fix it.       >>>>   >       >>>>   >       >>       >>>>   >       >> I have a question on the behavior of the bind       >>>>operation when       >>>>   no       >>>>   >       input fence       >>>>   >       >> is provided. Let say I do :       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence1)       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence2)       >>>>   >       >>       >>>>   >       >> VM_BIND (out_fence=fence3)       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> In what order are the fences going to be signaled?       >>>>   >       >>       >>>>   >       >> In the order of VM_BIND ioctls? Or out of order?       >>>>   >       >>       >>>>   >       >> Because you wrote "serialized I assume it's : in       order       >>>>   >       >>       >>>>   >       >>>>   >       Yes, in the order of VM_BIND/UNBIND ioctls. Note that       >>>>bind and       >>>>   unbind       >>>>   >       will use       >>>>   >       the same queue and hence are ordered.       >>>>   >       >>>>   >       >>       >>>>   >       >> One thing I didn't realize is that because we only       get one       >>>>   >       "VM_BIND" engine,       >>>>   >       >> there is a disconnect from the Vulkan specification.       >>>>   >       >>       >>>>   >       >> In Vulkan VM_BIND operations are serialized but       >>>>per engine.       >>>>   >       >>       >>>>   >       >> So you could have something like this :       >>>>   >       >>       >>>>   >       >> VM_BIND (engine=rcs0, in_fence=fence1,       out_fence=fence2)       >>>>   >       >>       >>>>   >       >> VM_BIND (engine=ccs0, in_fence=fence3,       out_fence=fence4)       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> fence1 is not signaled       >>>>   >       >>       >>>>   >       >> fence3 is signaled       >>>>   >       >>       >>>>   >       >> So the second VM_BIND will proceed before the       >>>>first VM_BIND.       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> I guess we can deal with that scenario in       >>>>userspace by doing       >>>>   the       >>>>   >       wait       >>>>   >       >> ourselves in one thread per engines.       >>>>   >       >>       >>>>   >       >> But then it makes the VM_BIND input fences useless.       >>>>   >       >>       >>>>   >       >>       >>>>   >       >> Daniel : what do you think? Should be rework this or       just       >>>>   deal with       >>>>   >       wait       >>>>   >       >> fences in userspace?       >>>>   >       >>       >>>>   >       >       >>>>   >       >My opinion is rework this but make the ordering via       >>>>an engine       >>>>   param       >>>>   >       optional.       >>>>   >       >       >>>>   >       >e.g. A VM can be configured so all binds are ordered       >>>>within the       >>>>   VM       >>>>   >       >       >>>>   >       >e.g. A VM can be configured so all binds accept an       engine       >>>>   argument       >>>>   >       (in       >>>>   >       >the case of the i915 likely this is a gem context       >>>>handle) and       >>>>   binds       >>>>   >       >ordered with respect to that engine.       >>>>   >       >       >>>>   >       >This gives UMDs options as the later likely consumes       >>>>more KMD       >>>>   >       resources       >>>>   >       >so if a different UMD can live with binds being       >>>>ordered within       >>>>   the VM       >>>>   >       >they can use a mode consuming less resources.       >>>>   >       >       >>>>   >       >>>>   >       I think we need to be careful here if we are looking       for some       >>>>   out of       >>>>   >       (submission) order completion of vm_bind/unbind.       >>>>   >       In-order completion means, in a batch of binds and       >>>>unbinds to be       >>>>   >       completed in-order, user only needs to specify       >>>>in-fence for the       >>>>   >       first bind/unbind call and the our-fence for the last       >>>>   bind/unbind       >>>>   >       call. Also, the VA released by an unbind call can be       >>>>re-used by       >>>>   >       any subsequent bind call in that in-order batch.       >>>>   >       >>>>   >       These things will break if binding/unbinding were to       >>>>be allowed       >>>>   to       >>>>   >       go out of order (of submission) and user need to be       extra       >>>>   careful       >>>>   >       not to run into pre-mature triggereing of out-fence and       bind       >>>>   failing       >>>>   >       as VA is still in use etc.       >>>>   >       >>>>   >       Also, VM_BIND binds the provided mapping on the       specified       >>>>   address       >>>>   >       space       >>>>   >       (VM). So, the uapi is not engine/context specific.       >>>>   >       >>>>   >       We can however add a 'queue' to the uapi which can be       >>>>one from       >>>>   the       >>>>   >       pre-defined queues,       >>>>   >       I915_VM_BIND_QUEUE_0       >>>>   >       I915_VM_BIND_QUEUE_1       >>>>   >       ...       >>>>   >       I915_VM_BIND_QUEUE_(N-1)       >>>>   >       >>>>   >       KMD will spawn an async work queue for each queue which       will       >>>>   only       >>>>   >       bind the mappings on that queue in the order of       submission.       >>>>   >       User can assign the queue to per engine or anything       >>>>like that.       >>>>   >       >>>>   >       But again here, user need to be careful and not       >>>>deadlock these       >>>>   >       queues with circular dependency of fences.       >>>>   >       >>>>   >       I prefer adding this later an as extension based on       >>>>whether it       >>>>   >       is really helping with the implementation.       >>>>   >       >>>>   >     I can tell you right now that having everything on a       single       >>>>   in-order       >>>>   >     queue will not get us the perf we want. What vulkan       >>>>really wants       >>>>   is one       >>>>   >     of two things:       >>>>   >      1. No implicit ordering of VM_BIND ops. They just       happen in       >>>>   whatever       >>>>   >     their dependencies are resolved and we ensure ordering       >>>>ourselves       >>>>   by       >>>>   >     having a syncobj in the VkQueue.       >>>>   >      2. The ability to create multiple VM_BIND queues. We       need at       >>>>   least 2       >>>>   >     but I don't see why there needs to be a limit besides       >>>>the limits       >>>>   the       >>>>   >     i915 API already has on the number of engines. Vulkan       could       >>>>   expose       >>>>   >     multiple sparse binding queues to the client if it's not       >>>>   arbitrarily       >>>>   >     limited.       >>>>       >>>>   Thanks Jason, Lionel.       >>>>       >>>>   Jason, what are you referring to when you say "limits the i915       API       >>>>   already       >>>>   has on the number of engines"? I am not sure if there is such       an uapi       >>>>   today.       >>>>       >>>> There's a limit of something like 64 total engines today based on       the       >>>> number of bits we can cram into the exec flags in execbuffer2. I       think       >>>> someone had an extended version that allowed more but I ripped it       out       >>>> because no one was using it. Of course, execbuffer3 might not       >>>>have that       >>>> problem at all.       >>>>       >>>       >>>Thanks Jason.       >>>Ok, I am not sure which exec flag is that, but yah, execbuffer3       probably       >>>will not have this limiation. So, we need to define a       VM_BIND_MAX_QUEUE       >>>and somehow export it to user (I am thinking of embedding it in       >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n'       meaning 2^n       >>>queues.       >>       >>Ah, I think you are waking about I915_EXEC_RING_MASK (0x3f) which       execbuf3

Yup! That's exactly the limit I was talking about.

>>will also have. So, we can simply define in vm_bind/unbind       structures,       >>       >>#define I915_VM_BIND_MAX_QUEUE   64       >>        __u32 queue;       >>       >>I think that will keep things simple.       >       >Hmmm? What does execbuf2 limit has to do with how many engines       >hardware can have? I suggest not to do that.       >       >Change with added this:       >       >       if (set.num_engines > I915_EXEC_RING_MASK + 1)       >               return -EINVAL;       >       >To context creation needs to be undone and so let users create engine       >maps with all hardware engines, and let execbuf3 access them all.       >

Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to execbuff3 also.       Hence, I was using the same limit for VM_BIND queues (64, or 65 if we       make it N+1).       But, as discussed in other thread of this RFC series, we are planning       to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be       any uapi that limits the number of engines (and hence the vm_bind       queues       need to be supported).

If we leave the number of vm_bind queues to be arbitrarily large       (__u32 queue_idx) then, we need to have a hashmap for queue (a wq,       work_item and a linked list) lookup from the user specified queue       index.       Other option is to just put some hard limit (say 64 or 65) and use       an array of queues in VM (each created upon first use). I prefer this.

I don't get why a VM_BIND queue is any different from any other queue or     userspace-visible kernel object. But I'll leave those details up to     danvet or whoever else might be reviewing the implementation.     --Jason

I kind of agree here. Wouldn't be simpler to have the bind queue created like the others when we build the engine map?

For userspace it's then just matter of selecting the right queue ID when submitting.

If there is ever a possibility to have this work on the GPU, it would be all ready.

I did sync offline with Matt Brost on this. We can add a VM_BIND engine class and let user create VM_BIND engines (queues). The problem is, in i915 engine creating interface is bound to gem_context. So, in vm_bind ioctl, we would need both context_id and queue_idx for proper lookup of the user created engine. This is bit ackward as vm_bind is an interface to VM (address space) and has nothing to do with gem_context.

A gem_context has a single vm object right?

Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.

So it's just like picking up the vm like it's done at execbuffer time right now : eb->context->vm

Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained from the context?

Yes, because if we go for engines, they're associated with a context and so also associated with the VM bound to the context.

Hmm...context doesn't sould like the right interface. It should be VM and engine (independent of context). Engine can be virtual or soft engine (kernel thread), each with its own queue. We can add an interface to create such engines (independent of context). But we are anway implicitly creating it when user uses a new queue_idx. If in future we have hardware engines for VM_BIND operation, we can have that explicit inteface to create engine instances and the queue_index in vm_bind/unbind will point to those engines. Anyone has any thoughts? Daniel?

Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.

So a cleaner interface to me is: user space create a vm, create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.

I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.

From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.

I didn't completely follow the discussion here. Just share some thoughts.

Regards, Oak

...

Niranjana

...
...
I think the interface is clean as a interface to VM. It is only that we don't have a clean way to create a raw VM_BIND engine (not associated with any context) with i915 uapi. May be we can add such an interface, but I don't think that is worth it (we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned above). Anyone has any thoughts?

...
...
Another problem is, if two VMs are binding with the same defined engine, binding on VM1 can get unnecessary blocked by binding on VM2 (which may be waiting on its in_fence).

Maybe I'm missing something, but how can you have 2 vm objects with a single gem_context right now?

No, we don't have 2 VMs for a gem_context. Say if ctx1 with vm1 and ctx2 with vm2. First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If those two queue indicies points to same underlying vm_bind engine, then the second vm_bind call gets blocked until the first vm_bind call's 'in' fence is triggered and bind completes.

With per VM queues, this is not a problem as two VMs will not endup sharing same queue.

BTW, I just posted a updated PATCH series. https://www.spinics.net/lists/dri-devel/msg350483.html

Niranjana

...
...
So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind ioctl, and the queues are per VM.

Niranjana

...
Thanks,

-Lionel

Niranjana

>Regards, > >Tvrtko > >> >>Niranjana >> >>> >>>> I am trying to see how many queues we need and don't want it to be >>>> arbitrarily >>>> large and unduely blow up memory usage and complexity in i915 driver. >>>> >>>> I expect a Vulkan driver to use at most 2 in the vast majority >>>>of cases. I >>>> could imagine a client wanting to create more than 1 sparse >>>>queue in which >>>> case, it'll be N+1 but that's unlikely. As far as complexity >>>>goes, once >>>> you allow two, I don't think the complexity is going up by >>>>allowing N. As >>>> for memory usage, creating more queues means more memory. That's a >>>> trade-off that userspace can make. Again, the expected number >>>>here is 1 >>>> or 2 in the vast majority of cases so I don't think you need to worry. >>> >>>Ok, will start with n=3 meaning 8 queues. >>>That would require us create 8 workqueues. >>>We can change 'n' later if required. >>> >>>Niranjana >>> >>>> >>>> > Why? Because Vulkan has two basic kind of bind >>>>operations and we >>>> don't >>>> > want any dependencies between them: >>>> > 1. Immediate. These happen right after BO creation or >>>>maybe as >>>> part of >>>> > vkBindImageMemory() or VkBindBufferMemory(). These >>>>don't happen >>>> on a >>>> > queue and we don't want them serialized with anything. To >>>> synchronize >>>> > with submit, we'll have a syncobj in the VkDevice which is >>>> signaled by >>>> > all immediate bind operations and make submits wait on it. >>>> > 2. Queued (sparse): These happen on a VkQueue which may be the >>>> same as >>>> > a render/compute queue or may be its own queue. It's up to us >>>> what we >>>> > want to advertise. From the Vulkan API PoV, this is like any >>>> other >>>> > queue. Operations on it wait on and signal semaphores. If we >>>> have a >>>> > VM_BIND engine, we'd provide syncobjs to wait and >>>>signal just like >>>> we do >>>> > in execbuf(). >>>> > The important thing is that we don't want one type of >>>>operation to >>>> block >>>> > on the other. If immediate binds are blocking on sparse binds, >>>> it's >>>> > going to cause over-synchronization issues. >>>> > In terms of the internal implementation, I know that >>>>there's going >>>> to be >>>> > a lock on the VM and that we can't actually do these things in >>>> > parallel. That's fine. Once the dma_fences have signaled and >>>> we're >>>> >>>> Thats correct. It is like a single VM_BIND engine with >>>>multiple queues >>>> feeding to it. >>>> >>>> Right. As long as the queues themselves are independent and >>>>can block on >>>> dma_fences without holding up other queues, I think we're fine. >>>> >>>> > unblocked to do the bind operation, I don't care if >>>>there's a bit >>>> of >>>> > synchronization due to locking. That's expected. What >>>>we can't >>>> afford >>>> > to have is an immediate bind operation suddenly blocking on a >>>> sparse >>>> > operation which is blocked on a compute job that's going to run >>>> for >>>> > another 5ms. >>>> >>>> As the VM_BIND queue is per VM, VM_BIND on one VM doesn't block the >>>> VM_BIND >>>> on other VMs. I am not sure about usecases here, but just wanted to >>>> clarify. >>>> >>>> Yes, that's what I would expect. >>>> --Jason >>>> >>>> Niranjana >>>> >>>> > For reference, Windows solves this by allowing arbitrarily many >>>> paging >>>> > queues (what they call a VM_BIND engine/queue). That >>>>design works >>>> > pretty well and solves the problems in question. >>>>Again, we could >>>> just >>>> > make everything out-of-order and require using syncobjs >>>>to order >>>> things >>>> > as userspace wants. That'd be fine too. >>>> > One more note while I'm here: danvet said something on >>>>IRC about >>>> VM_BIND >>>> > queues waiting for syncobjs to materialize. We don't really >>>> want/need >>>> > this. We already have all the machinery in userspace to handle >>>> > wait-before-signal and waiting for syncobj fences to >>>>materialize >>>> and >>>> > that machinery is on by default. It would actually >>>>take MORE work >>>> in >>>> > Mesa to turn it off and take advantage of the kernel >>>>being able to >>>> wait >>>> > for syncobjs to materialize. Also, getting that right is >>>> ridiculously >>>> > hard and I really don't want to get it wrong in kernel >>>>space. �� When we >>>> > do memory fences, wait-before-signal will be a thing. We don't >>>> need to >>>> > try and make it a thing for syncobj. >>>> > --Jason >>>> > >>>> > Thanks Jason, >>>> > >>>> > I missed the bit in the Vulkan spec that we're allowed to have a >>>> sparse >>>> > queue that does not implement either graphics or compute >>>>operations >>>> : >>>> > >>>> > "While some implementations may include >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>> > support in queue families that also include >>>> > >>>> > graphics and compute support, other implementations may only >>>> expose a >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>> > >>>> > family." >>>> > >>>> > So it can all be all a vm_bind engine that just does bind/unbind >>>> > operations. >>>> > >>>> > But yes we need another engine for the immediate/non-sparse >>>> operations. >>>> > >>>> > -Lionel >>>> > >>>> > > >>>> > Daniel, any thoughts? >>>> > >>>> > Niranjana >>>> > >>>> > >Matt >>>> > > >>>> > >> >>>> > >> Sorry I noticed this late. >>>> > >> >>>> > >> >>>> > >> -Lionel >>>> > >> >>>> > >>

Niranjana Vishwanathapura

6:02 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:

...

Regards, Oak

...
-----Original Message----- From: Intel-gfx intel-gfx-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: June 10, 2022 1:43 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Intel GFX intel-gfx@lists.freedesktop.org; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P chris.p.wilson@intel.com; Vetter, Daniel daniel.vetter@intel.com; Christian König christian.koenig@amd.com Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:

...
On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:

...
On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:

...
On 09/06/2022 22:31, Niranjana Vishwanathapura wrote:

...
On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote: > On 09/06/2022 00:55, Jason Ekstrand wrote: > > On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura > niranjana.vishwanathapura@intel.com wrote: > > On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote: > > > > > >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: > >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana >Vishwanathapura > wrote: > >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason >Ekstrand wrote: > >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura > >>>> niranjana.vishwanathapura@intel.com wrote: > >>>> > >>>> On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel >Landwerlin > wrote: > >>>> > On 02/06/2022 23:35, Jason Ekstrand wrote: > >>>> > > >>>> > On Thu, Jun 2, 2022 at 3:11 PM Niranjana >Vishwanathapura > >>>> > niranjana.vishwanathapura@intel.com wrote: > >>>> > > >>>> > On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew > >>>>Brost wrote: > >>>> > >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel > Landwerlin > >>>> wrote: > >>>> > >> On 17/05/2022 21:32, Niranjana Vishwanathapura > wrote: > >>>> > >> > +VM_BIND/UNBIND ioctl will immediately start > >>>> binding/unbinding > >>>> > the mapping in an > >>>> > >> > +async worker. The binding and >unbinding will > >>>>work like a > >>>> special > >>>> > GPU engine. > >>>> > >> > +The binding and unbinding operations are > serialized and > >>>> will > >>>> > wait on specified > >>>> > >> > +input fences before the operation >and will signal > the > >>>> output > >>>> > fences upon the > >>>> > >> > +completion of the operation. Due to > serialization, > >>>> completion of > >>>> > an operation > >>>> > >> > +will also indicate that all >previous operations > >>>>are also > >>>> > complete. > >>>> > >> > >>>> > >> I guess we should avoid saying "will >immediately > start > >>>> > binding/unbinding" if > >>>> > >> there are fences involved. > >>>> > >> > >>>> > >> And the fact that it's happening in an async > >>>>worker seem to > >>>> imply > >>>> > it's not > >>>> > >> immediate. > >>>> > >> > >>>> > > >>>> > Ok, will fix. > >>>> > This was added because in earlier design >binding was > deferred > >>>> until > >>>> > next execbuff. > >>>> > But now it is non-deferred (immediate in >that sense). > >>>>But yah, > >>>> this is > >>>> > confusing > >>>> > and will fix it. > >>>> > > >>>> > >> > >>>> > >> I have a question on the behavior of the bind > >>>>operation when > >>>> no > >>>> > input fence > >>>> > >> is provided. Let say I do : > >>>> > >> > >>>> > >> VM_BIND (out_fence=fence1) > >>>> > >> > >>>> > >> VM_BIND (out_fence=fence2) > >>>> > >> > >>>> > >> VM_BIND (out_fence=fence3) > >>>> > >> > >>>> > >> > >>>> > >> In what order are the fences going to >be signaled? > >>>> > >> > >>>> > >> In the order of VM_BIND ioctls? Or out >of order? > >>>> > >> > >>>> > >> Because you wrote "serialized I assume >it's : in > order > >>>> > >> > >>>> > > >>>> > Yes, in the order of VM_BIND/UNBIND >ioctls. Note that > >>>>bind and > >>>> unbind > >>>> > will use > >>>> > the same queue and hence are ordered. > >>>> > > >>>> > >> > >>>> > >> One thing I didn't realize is that >because we only > get one > >>>> > "VM_BIND" engine, > >>>> > >> there is a disconnect from the Vulkan >specification. > >>>> > >> > >>>> > >> In Vulkan VM_BIND operations are >serialized but > >>>>per engine. > >>>> > >> > >>>> > >> So you could have something like this : > >>>> > >> > >>>> > >> VM_BIND (engine=rcs0, in_fence=fence1, > out_fence=fence2) > >>>> > >> > >>>> > >> VM_BIND (engine=ccs0, in_fence=fence3, > out_fence=fence4) > >>>> > >> > >>>> > >> > >>>> > >> fence1 is not signaled > >>>> > >> > >>>> > >> fence3 is signaled > >>>> > >> > >>>> > >> So the second VM_BIND will proceed before the > >>>>first VM_BIND. > >>>> > >> > >>>> > >> > >>>> > >> I guess we can deal with that scenario in > >>>>userspace by doing > >>>> the > >>>> > wait > >>>> > >> ourselves in one thread per engines. > >>>> > >> > >>>> > >> But then it makes the VM_BIND input >fences useless. > >>>> > >> > >>>> > >> > >>>> > >> Daniel : what do you think? Should be >rework this or > just > >>>> deal with > >>>> > wait > >>>> > >> fences in userspace? > >>>> > >> > >>>> > > > >>>> > >My opinion is rework this but make the >ordering via > >>>>an engine > >>>> param > >>>> > optional. > >>>> > > > >>>> > >e.g. A VM can be configured so all binds >are ordered > >>>>within the > >>>> VM > >>>> > > > >>>> > >e.g. A VM can be configured so all binds >accept an > engine > >>>> argument > >>>> > (in > >>>> > >the case of the i915 likely this is a >gem context > >>>>handle) and > >>>> binds > >>>> > >ordered with respect to that engine. > >>>> > > > >>>> > >This gives UMDs options as the later >likely consumes > >>>>more KMD > >>>> > resources > >>>> > >so if a different UMD can live with binds being > >>>>ordered within > >>>> the VM > >>>> > >they can use a mode consuming less resources. > >>>> > > > >>>> > > >>>> > I think we need to be careful here if we >are looking > for some > >>>> out of > >>>> > (submission) order completion of vm_bind/unbind. > >>>> > In-order completion means, in a batch of >binds and > >>>>unbinds to be > >>>> > completed in-order, user only needs to specify > >>>>in-fence for the > >>>> > first bind/unbind call and the our-fence >for the last > >>>> bind/unbind > >>>> > call. Also, the VA released by an unbind >call can be > >>>>re-used by > >>>> > any subsequent bind call in that in-order batch. > >>>> > > >>>> > These things will break if >binding/unbinding were to > >>>>be allowed > >>>> to > >>>> > go out of order (of submission) and user >need to be > extra > >>>> careful > >>>> > not to run into pre-mature triggereing of >out-fence and > bind > >>>> failing > >>>> > as VA is still in use etc. > >>>> > > >>>> > Also, VM_BIND binds the provided mapping on the > specified > >>>> address > >>>> > space > >>>> > (VM). So, the uapi is not engine/context >specific. > >>>> > > >>>> > We can however add a 'queue' to the uapi >which can be > >>>>one from > >>>> the > >>>> > pre-defined queues, > >>>> > I915_VM_BIND_QUEUE_0 > >>>> > I915_VM_BIND_QUEUE_1 > >>>> > ... > >>>> > I915_VM_BIND_QUEUE_(N-1) > >>>> > > >>>> > KMD will spawn an async work queue for >each queue which > will > >>>> only > >>>> > bind the mappings on that queue in the order of > submission. > >>>> > User can assign the queue to per engine >or anything > >>>>like that. > >>>> > > >>>> > But again here, user need to be careful and not > >>>>deadlock these > >>>> > queues with circular dependency of fences. > >>>> > > >>>> > I prefer adding this later an as >extension based on > >>>>whether it > >>>> > is really helping with the implementation. > >>>> > > >>>> > I can tell you right now that having >everything on a > single > >>>> in-order > >>>> > queue will not get us the perf we want. >What vulkan > >>>>really wants > >>>> is one > >>>> > of two things: > >>>> > 1. No implicit ordering of VM_BIND ops. They just > happen in > >>>> whatever > >>>> > their dependencies are resolved and we >ensure ordering > >>>>ourselves > >>>> by > >>>> > having a syncobj in the VkQueue. > >>>> > 2. The ability to create multiple VM_BIND >queues. We > need at > >>>> least 2 > >>>> > but I don't see why there needs to be a >limit besides > >>>>the limits > >>>> the > >>>> > i915 API already has on the number of >engines. Vulkan > could > >>>> expose > >>>> > multiple sparse binding queues to the >client if it's not > >>>> arbitrarily > >>>> > limited. > >>>> > >>>> Thanks Jason, Lionel. > >>>> > >>>> Jason, what are you referring to when you say >"limits the i915 > API > >>>> already > >>>> has on the number of engines"? I am not sure if >there is such > an uapi > >>>> today. > >>>> > >>>> There's a limit of something like 64 total engines >today based on > the > >>>> number of bits we can cram into the exec flags in >execbuffer2. I > think > >>>> someone had an extended version that allowed more >but I ripped it > out > >>>> because no one was using it. Of course, >execbuffer3 might not > >>>>have that > >>>> problem at all. > >>>> > >>> > >>>Thanks Jason. > >>>Ok, I am not sure which exec flag is that, but yah, >execbuffer3 > probably > >>>will not have this limiation. So, we need to define a > VM_BIND_MAX_QUEUE > >>>and somehow export it to user (I am thinking of >embedding it in > >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' > meaning 2^n > >>>queues. > >> > >>Ah, I think you are waking about I915_EXEC_RING_MASK >(0x3f) which > execbuf3 > > Yup! That's exactly the limit I was talking about. > > >>will also have. So, we can simply define in vm_bind/unbind > structures, > >> > >>#define I915_VM_BIND_MAX_QUEUE 64 > >> __u32 queue; > >> > >>I think that will keep things simple. > > > >Hmmm? What does execbuf2 limit has to do with how many engines > >hardware can have? I suggest not to do that. > > > >Change with added this: > > > > if (set.num_engines > I915_EXEC_RING_MASK + 1) > > return -EINVAL; > > > >To context creation needs to be undone and so let users >create engine > >maps with all hardware engines, and let execbuf3 access >them all. > > > > Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to >execbuff3 also. > Hence, I was using the same limit for VM_BIND queues >(64, or 65 if we > make it N+1). > But, as discussed in other thread of this RFC series, we >are planning > to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be > any uapi that limits the number of engines (and hence >the vm_bind > queues > need to be supported). > > If we leave the number of vm_bind queues to be arbitrarily large > (__u32 queue_idx) then, we need to have a hashmap for >queue (a wq, > work_item and a linked list) lookup from the user >specified queue > index. > Other option is to just put some hard limit (say 64 or >65) and use > an array of queues in VM (each created upon first use). >I prefer this. > > I don't get why a VM_BIND queue is any different from any >other queue or > userspace-visible kernel object. But I'll leave those >details up to > danvet or whoever else might be reviewing the implementation. > --Jason > > I kind of agree here. Wouldn't be simpler to have the bind >queue created > like the others when we build the engine map? > > For userspace it's then just matter of selecting the right >queue ID when > submitting. > > If there is ever a possibility to have this work on the GPU, >it would be > all ready. >

I did sync offline with Matt Brost on this. We can add a VM_BIND engine class and let user create VM_BIND engines (queues). The problem is, in i915 engine creating interface is bound to gem_context. So, in vm_bind ioctl, we would need both context_id and queue_idx for proper lookup of the user created engine. This is bit ackward as vm_bind is an interface to VM (address space) and has nothing to do with gem_context.

A gem_context has a single vm object right?

Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.

So it's just like picking up the vm like it's done at execbuffer time right now : eb->context->vm

Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained from the context?

Yes, because if we go for engines, they're associated with a context and so also associated with the VM bound to the context.

Hmm...context doesn't sould like the right interface. It should be VM and engine (independent of context). Engine can be virtual or soft engine (kernel thread), each with its own queue. We can add an interface to create such engines (independent of context). But we are anway implicitly creating it when user uses a new queue_idx. If in future we have hardware engines for VM_BIND operation, we can have that explicit inteface to create engine instances and the queue_index in vm_bind/unbind will point to those engines. Anyone has any thoughts? Daniel?

Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.

So a cleaner interface to me is: user space create a vm, create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.

I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.

From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.

I didn't completely follow the discussion here. Just share some thoughts.

Yah, I agree.

Lionel, How about we define the queue as union { __u32 queue_idx; __u64 rsvd; }

If required, we can extend by expanding the 'rsvd' field to <ctx_id, queue_idx> later with a flag.

Niranjana

...

Regards, Oak

...
Niranjana

...
...
I think the interface is clean as a interface to VM. It is only that we don't have a clean way to create a raw VM_BIND engine (not associated with any context) with i915 uapi. May be we can add such an interface, but I don't think that is worth it (we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned above). Anyone has any thoughts?

...
...
Another problem is, if two VMs are binding with the same defined engine, binding on VM1 can get unnecessary blocked by binding on VM2 (which may be waiting on its in_fence).

Maybe I'm missing something, but how can you have 2 vm objects with a single gem_context right now?

No, we don't have 2 VMs for a gem_context. Say if ctx1 with vm1 and ctx2 with vm2. First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If those two queue indicies points to same underlying vm_bind engine, then the second vm_bind call gets blocked until the first vm_bind call's 'in' fence is triggered and bind completes.

With per VM queues, this is not a problem as two VMs will not endup sharing same queue.

BTW, I just posted a updated PATCH series. https://www.spinics.net/lists/dri-devel/msg350483.html

Niranjana

...
...
So, my preference here is to just add a 'u32 queue' index in vm_bind/unbind ioctl, and the queues are per VM.

Niranjana

> Thanks, > > -Lionel > > > Niranjana > > >Regards, > > > >Tvrtko > > > >> > >>Niranjana > >> > >>> > >>>> I am trying to see how many queues we need and >don't want it to > be > >>>> arbitrarily > >>>> large and unduely blow up memory usage and >complexity in i915 > driver. > >>>> > >>>> I expect a Vulkan driver to use at most 2 in the >vast majority > >>>>of cases. I > >>>> could imagine a client wanting to create more than 1 sparse > >>>>queue in which > >>>> case, it'll be N+1 but that's unlikely. As far as >complexity > >>>>goes, once > >>>> you allow two, I don't think the complexity is going up by > >>>>allowing N. As > >>>> for memory usage, creating more queues means more >memory. That's > a > >>>> trade-off that userspace can make. Again, the >expected number > >>>>here is 1 > >>>> or 2 in the vast majority of cases so I don't think >you need to > worry. > >>> > >>>Ok, will start with n=3 meaning 8 queues. > >>>That would require us create 8 workqueues. > >>>We can change 'n' later if required. > >>> > >>>Niranjana > >>> > >>>> > >>>> > Why? Because Vulkan has two basic kind of bind > >>>>operations and we > >>>> don't > >>>> > want any dependencies between them: > >>>> > 1. Immediate. These happen right after BO >creation or > >>>>maybe as > >>>> part of > >>>> > vkBindImageMemory() or VkBindBufferMemory(). These > >>>>don't happen > >>>> on a > >>>> > queue and we don't want them serialized >with anything. To > >>>> synchronize > >>>> > with submit, we'll have a syncobj in the >VkDevice which > is > >>>> signaled by > >>>> > all immediate bind operations and make >submits wait on > it. > >>>> > 2. Queued (sparse): These happen on a >VkQueue which may > be the > >>>> same as > >>>> > a render/compute queue or may be its own >queue. It's up > to us > >>>> what we > >>>> > want to advertise. From the Vulkan API >PoV, this is like > any > >>>> other > >>>> > queue. Operations on it wait on and signal >semaphores. If we > >>>> have a > >>>> > VM_BIND engine, we'd provide syncobjs to wait and > >>>>signal just like > >>>> we do > >>>> > in execbuf(). > >>>> > The important thing is that we don't want >one type of > >>>>operation to > >>>> block > >>>> > on the other. If immediate binds are >blocking on sparse > binds, > >>>> it's > >>>> > going to cause over-synchronization issues. > >>>> > In terms of the internal implementation, I >know that > >>>>there's going > >>>> to be > >>>> > a lock on the VM and that we can't actually >do these > things in > >>>> > parallel. That's fine. Once the dma_fences have > signaled and > >>>> we're > >>>> > >>>> Thats correct. It is like a single VM_BIND engine with > >>>>multiple queues > >>>> feeding to it. > >>>> > >>>> Right. As long as the queues themselves are >independent and > >>>>can block on > >>>> dma_fences without holding up other queues, I think >we're fine. > >>>> > >>>> > unblocked to do the bind operation, I don't care if > >>>>there's a bit > >>>> of > >>>> > synchronization due to locking. That's >expected. What > >>>>we can't > >>>> afford > >>>> > to have is an immediate bind operation >suddenly blocking > on a > >>>> sparse > >>>> > operation which is blocked on a compute job >that's going > to run > >>>> for > >>>> > another 5ms. > >>>> > >>>> As the VM_BIND queue is per VM, VM_BIND on one VM >doesn't block > the > >>>> VM_BIND > >>>> on other VMs. I am not sure about usecases here, but just > wanted to > >>>> clarify. > >>>> > >>>> Yes, that's what I would expect. > >>>> --Jason > >>>> > >>>> Niranjana > >>>> > >>>> > For reference, Windows solves this by allowing > arbitrarily many > >>>> paging > >>>> > queues (what they call a VM_BIND >engine/queue). That > >>>>design works > >>>> > pretty well and solves the problems in >question. >>>>Again, we could > >>>> just > >>>> > make everything out-of-order and require >using syncobjs > >>>>to order > >>>> things > >>>> > as userspace wants. That'd be fine too. > >>>> > One more note while I'm here: danvet said >something on > >>>>IRC about > >>>> VM_BIND > >>>> > queues waiting for syncobjs to >materialize. We don't > really > >>>> want/need > >>>> > this. We already have all the machinery in >userspace to > handle > >>>> > wait-before-signal and waiting for syncobj >fences to > >>>>materialize > >>>> and > >>>> > that machinery is on by default. It would actually > >>>>take MORE work > >>>> in > >>>> > Mesa to turn it off and take advantage of >the kernel > >>>>being able to > >>>> wait > >>>> > for syncobjs to materialize. Also, getting >that right is > >>>> ridiculously > >>>> > hard and I really don't want to get it >wrong in kernel > >>>>space. �� When we > >>>> > do memory fences, wait-before-signal will >be a thing. We > don't > >>>> need to > >>>> > try and make it a thing for syncobj. > >>>> > --Jason > >>>> > > >>>> > Thanks Jason, > >>>> > > >>>> > I missed the bit in the Vulkan spec that >we're allowed to > have a > >>>> sparse > >>>> > queue that does not implement either graphics >or compute > >>>>operations > >>>> : > >>>> > > >>>> > "While some implementations may include > >>>> VK_QUEUE_SPARSE_BINDING_BIT > >>>> > support in queue families that also include > >>>> > > >>>> > graphics and compute support, other >implementations may > only > >>>> expose a > >>>> > VK_QUEUE_SPARSE_BINDING_BIT-only queue > >>>> > > >>>> > family." > >>>> > > >>>> > So it can all be all a vm_bind engine that just does > bind/unbind > >>>> > operations. > >>>> > > >>>> > But yes we need another engine for the >immediate/non-sparse > >>>> operations. > >>>> > > >>>> > -Lionel > >>>> > > >>>> > > > >>>> > Daniel, any thoughts? > >>>> > > >>>> > Niranjana > >>>> > > >>>> > >Matt > >>>> > > > >>>> > >> > >>>> > >> Sorry I noticed this late. > >>>> > >> > >>>> > >> > >>>> > >> -Lionel > >>>> > >> > >>>> > >>

Lionel Landwerlin

14 Jun 14 Jun

7:04 a.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:

...

On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:

...
Regards, Oak

...
-----Original Message----- From: Intel-gfx intel-gfx-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: June 10, 2022 1:43 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Intel GFX intel-gfx@lists.freedesktop.org; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P chris.p.wilson@intel.com; Vetter, Daniel daniel.vetter@intel.com; Christian König christian.koenig@amd.com Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:

...
On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:

...
On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote:

...
On 09/06/2022 22:31, Niranjana Vishwanathapura wrote: >On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote: >> On 09/06/2022 00:55, Jason Ekstrand wrote: >> >>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura >> niranjana.vishwanathapura@intel.com wrote: >> >>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin

wrote:

...
...
...
>>      > >>      > >>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana >>Vishwanathapura >>      wrote: >>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason >>Ekstrand wrote: >>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana

Vishwanathapura

...
...
...
>>      >>>> niranjana.vishwanathapura@intel.com wrote: >>      >>>> >>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel >>Landwerlin >>      wrote: >>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote: >>      >>>>   > >>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana >>Vishwanathapura >>      >>>>   > niranjana.vishwanathapura@intel.com wrote: >>      >>>>   > >>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700,

Matthew

...
...
...
>>      >>>>Brost wrote: >>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300,

Lionel

...
...
...
>>      Landwerlin >>      >>>>   wrote: >>      >>>>   > >> On 17/05/2022 21:32, Niranjana Vishwanathapura >>      wrote: >>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start >>      >>>>   binding/unbinding >>      >>>>   >       the mapping in an >>      >>>>   > >> > +async worker. The binding and >>unbinding will >>      >>>>work like a >>      >>>>   special >>      >>>>   >       GPU engine. >>      >>>>   > >> > +The binding and unbinding operations are >>      serialized and >>      >>>>   will >>      >>>>   >       wait on specified >>      >>>>   > >> > +input fences before the operation >>and will signal >>      the >>      >>>>   output >>      >>>>   >       fences upon the >>      >>>>   > >> > +completion of the operation. Due to >>      serialization, >>      >>>>   completion of >>      >>>>   >       an operation >>      >>>>   > >> > +will also indicate that all >>previous operations >>      >>>>are also >>      >>>>   > complete. >>      >>>>   > >> >>      >>>>   > >> I guess we should avoid saying "will >>immediately >>      start >>      >>>>   > binding/unbinding" if >>      >>>>   > >> there are fences involved. >>      >>>>   > >> >>      >>>>   > >> And the fact that it's happening in an async >>      >>>>worker seem to >>      >>>>   imply >>      >>>>   >       it's not >>      >>>>   > >> immediate. >>      >>>>   > >> >>      >>>>   > >>      >>>>   >       Ok, will fix. >>      >>>>   >       This was added because in earlier design >>binding was >>      deferred >>      >>>>   until >>      >>>>   >       next execbuff. >>      >>>>   >       But now it is non-deferred (immediate in >>that sense). >>      >>>>But yah, >>      >>>>   this is >>      >>>>   > confusing >>      >>>>   >       and will fix it. >>      >>>>   > >>      >>>>   > >> >>      >>>>   > >> I have a question on the behavior of the bind >>      >>>>operation when >>      >>>>   no >>      >>>>   >       input fence >>      >>>>   > >> is provided. Let say I do : >>      >>>>   > >> >>      >>>>   > >> VM_BIND (out_fence=fence1) >>      >>>>   > >> >>      >>>>   > >> VM_BIND (out_fence=fence2) >>      >>>>   > >> >>      >>>>   > >> VM_BIND (out_fence=fence3) >>      >>>>   > >> >>      >>>>   > >> >>      >>>>   > >> In what order are the fences going to >>be signaled? >>      >>>>   > >> >>      >>>>   > >> In the order of VM_BIND ioctls? Or out >>of order? >>      >>>>   > >> >>      >>>>   > >> Because you wrote "serialized I assume >>it's : in >>      order >>      >>>>   > >> >>      >>>>   > >>      >>>>   >       Yes, in the order of VM_BIND/UNBIND >>ioctls. Note that >>      >>>>bind and >>      >>>>   unbind >>      >>>>   >       will use >>      >>>>   >       the same queue and hence are ordered. >>      >>>>   > >>      >>>>   > >> >>      >>>>   > >> One thing I didn't realize is that >>because we only >>      get one >>      >>>>   > "VM_BIND" engine, >>      >>>>   > >> there is a disconnect from the Vulkan >>specification. >>      >>>>   > >> >>      >>>>   > >> In Vulkan VM_BIND operations are >>serialized but >>      >>>>per engine. >>      >>>>   > >> >>      >>>>   > >> So you could have something like this : >>      >>>>   > >> >>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1, >>      out_fence=fence2) >>      >>>>   > >> >>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3, >>      out_fence=fence4) >>      >>>>   > >> >>      >>>>   > >> >>      >>>>   > >> fence1 is not signaled >>      >>>>   > >> >>      >>>>   > >> fence3 is signaled >>      >>>>   > >> >>      >>>>   > >> So the second VM_BIND will proceed before the >>      >>>>first VM_BIND. >>      >>>>   > >> >>      >>>>   > >> >>      >>>>   > >> I guess we can deal with that scenario in >>      >>>>userspace by doing >>      >>>>   the >>      >>>>   >       wait >>      >>>>   > >> ourselves in one thread per engines. >>      >>>>   > >> >>      >>>>   > >> But then it makes the VM_BIND input >>fences useless. >>      >>>>   > >> >>      >>>>   > >> >>      >>>>   > >> Daniel : what do you think? Should be >>rework this or >>      just >>      >>>>   deal with >>      >>>>   >       wait >>      >>>>   > >> fences in userspace? >>      >>>>   > >> >>      >>>>   >       > >>      >>>>   >       >My opinion is rework this but make the >>ordering via >>      >>>>an engine >>      >>>>   param >>      >>>>   > optional. >>      >>>>   >       > >>      >>>>   > >e.g. A VM can be configured so all binds >>are ordered >>      >>>>within the >>      >>>>   VM >>      >>>>   >       > >>      >>>>   > >e.g. A VM can be configured so all binds >>accept an >>      engine >>      >>>>   argument >>      >>>>   >       (in >>      >>>>   > >the case of the i915 likely this is a >>gem context >>      >>>>handle) and >>      >>>>   binds >>      >>>>   > >ordered with respect to that engine. >>      >>>>   >       > >>      >>>>   > >This gives UMDs options as the later >>likely consumes >>      >>>>more KMD >>      >>>>   > resources >>      >>>>   >       >so if a different UMD can live with binds

being

...
...
...
>>      >>>>ordered within >>      >>>>   the VM >>      >>>>   > >they can use a mode consuming less resources. >>      >>>>   >       > >>      >>>>   > >>      >>>>   >       I think we need to be careful here if we >>are looking >>      for some >>      >>>>   out of >>      >>>>   > (submission) order completion of vm_bind/unbind. >>      >>>>   > In-order completion means, in a batch of >>binds and >>      >>>>unbinds to be >>      >>>>   > completed in-order, user only needs to specify >>      >>>>in-fence for the >>      >>>>   >       first bind/unbind call and the our-fence >>for the last >>      >>>>   bind/unbind >>      >>>>   >       call. Also, the VA released by an unbind >>call can be >>      >>>>re-used by >>      >>>>   >       any subsequent bind call in that in-order

batch.

...
...
...
>>      >>>>   > >>      >>>>   >       These things will break if >>binding/unbinding were to >>      >>>>be allowed >>      >>>>   to >>      >>>>   >       go out of order (of submission) and user >>need to be >>      extra >>      >>>>   careful >>      >>>>   >       not to run into pre-mature triggereing of >>out-fence and >>      bind >>      >>>>   failing >>      >>>>   >       as VA is still in use etc. >>      >>>>   > >>      >>>>   >       Also, VM_BIND binds the provided mapping

on the

...
...
...
>>      specified >>      >>>>   address >>      >>>>   >       space >>      >>>>   >       (VM). So, the uapi is not engine/context >>specific. >>      >>>>   > >>      >>>>   >       We can however add a 'queue' to the uapi >>which can be >>      >>>>one from >>      >>>>   the >>      >>>>   > pre-defined queues, >>      >>>>   > I915_VM_BIND_QUEUE_0 >>      >>>>   > I915_VM_BIND_QUEUE_1 >>      >>>>   >       ... >>      >>>>   > I915_VM_BIND_QUEUE_(N-1) >>      >>>>   > >>      >>>>   >       KMD will spawn an async work queue for >>each queue which >>      will >>      >>>>   only >>      >>>>   >       bind the mappings on that queue in the

order of

...
...
...
>>      submission. >>      >>>>   >       User can assign the queue to per engine >>or anything >>      >>>>like that. >>      >>>>   > >>      >>>>   >       But again here, user need to be careful

and not

...
...
...
>>      >>>>deadlock these >>      >>>>   >       queues with circular dependency of fences. >>      >>>>   > >>      >>>>   >       I prefer adding this later an as >>extension based on >>      >>>>whether it >>      >>>>   >       is really helping with the implementation. >>      >>>>   > >>      >>>>   >     I can tell you right now that having >>everything on a >>      single >>      >>>>   in-order >>      >>>>   >     queue will not get us the perf we want. >>What vulkan >>      >>>>really wants >>      >>>>   is one >>      >>>>   >     of two things: >>      >>>>   >      1. No implicit ordering of VM_BIND ops.

They just

...
...
...
>>      happen in >>      >>>>   whatever >>      >>>>   >     their dependencies are resolved and we >>ensure ordering >>      >>>>ourselves >>      >>>>   by >>      >>>>   >     having a syncobj in the VkQueue. >>      >>>>   >      2. The ability to create multiple VM_BIND >>queues. We >>      need at >>      >>>>   least 2 >>      >>>>   >     but I don't see why there needs to be a >>limit besides >>      >>>>the limits >>      >>>>   the >>      >>>>   >     i915 API already has on the number of >>engines. Vulkan >>      could >>      >>>>   expose >>      >>>>   >     multiple sparse binding queues to the >>client if it's not >>      >>>>   arbitrarily >>      >>>>   >     limited. >>      >>>> >>      >>>>   Thanks Jason, Lionel. >>      >>>> >>      >>>>   Jason, what are you referring to when you say >>"limits the i915 >>      API >>      >>>>   already >>      >>>>   has on the number of engines"? I am not sure if >>there is such >>      an uapi >>      >>>>   today. >>      >>>> >>      >>>> There's a limit of something like 64 total engines >>today based on >>      the >>      >>>> number of bits we can cram into the exec flags in >>execbuffer2. I >>      think >>      >>>> someone had an extended version that allowed more >>but I ripped it >>      out >>      >>>> because no one was using it. Of course, >>execbuffer3 might not >>      >>>>have that >>      >>>> problem at all. >>      >>>> >>      >>> >>      >>>Thanks Jason. >>      >>>Ok, I am not sure which exec flag is that, but yah, >>execbuffer3 >>      probably >>      >>>will not have this limiation. So, we need to define a >>      VM_BIND_MAX_QUEUE >>      >>>and somehow export it to user (I am thinking of >>embedding it in >>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND,

bits[1-3]->'n'

...
...
...
>>      meaning 2^n >>      >>>queues. >>      >> >>      >>Ah, I think you are waking about I915_EXEC_RING_MASK >>(0x3f) which >>      execbuf3 >> >>    Yup! That's exactly the limit I was talking about. >> >>      >>will also have. So, we can simply define in vm_bind/unbind >>      structures, >>      >> >>      >>#define I915_VM_BIND_MAX_QUEUE   64 >>      >>        __u32 queue; >>      >> >>      >>I think that will keep things simple. >>      > >>      >Hmmm? What does execbuf2 limit has to do with how many

engines

...
...
...
>>      >hardware can have? I suggest not to do that. >>      > >>      >Change with added this: >>      > >>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1) >>      >               return -EINVAL; >>      > >>      >To context creation needs to be undone and so let users >>create engine >>      >maps with all hardware engines, and let execbuf3 access >>them all. >>      > >> >>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to >>execbuff3 also. >>      Hence, I was using the same limit for VM_BIND queues >>(64, or 65 if we >>      make it N+1). >>      But, as discussed in other thread of this RFC series, we >>are planning >>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there

won't be

...
...
...
>>      any uapi that limits the number of engines (and hence >>the vm_bind >>      queues >>      need to be supported). >> >>      If we leave the number of vm_bind queues to be

arbitrarily large

...
...
...
>>      (__u32 queue_idx) then, we need to have a hashmap for >>queue (a wq, >>      work_item and a linked list) lookup from the user >>specified queue >>      index. >>      Other option is to just put some hard limit (say 64 or >>65) and use >>      an array of queues in VM (each created upon first use). >>I prefer this. >> >>    I don't get why a VM_BIND queue is any different from any >>other queue or >>    userspace-visible kernel object. But I'll leave those >>details up to >>    danvet or whoever else might be reviewing the implementation. >>    --Jason >> >> I kind of agree here. Wouldn't be simpler to have the bind >>queue created >> like the others when we build the engine map? >> >> For userspace it's then just matter of selecting the right >>queue ID when >> submitting. >> >> If there is ever a possibility to have this work on the GPU, >>it would be >> all ready. >> > >I did sync offline with Matt Brost on this. >We can add a VM_BIND engine class and let user create VM_BIND >engines (queues). >The problem is, in i915 engine creating interface is bound to >gem_context. >So, in vm_bind ioctl, we would need both context_id and >queue_idx for proper >lookup of the user created engine. This is bit ackward as

vm_bind is an

...
...
...
>interface to VM (address space) and has nothing to do with

gem_context.

...
...
...
A gem_context has a single vm object right?

Set through I915_CONTEXT_PARAM_VM at creation or given a default one if not.

So it's just like picking up the vm like it's done at execbuffer time right now : eb->context->vm

Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained from the context?

Yes, because if we go for engines, they're associated with a context and so also associated with the VM bound to the context.

Hmm...context doesn't sould like the right interface. It should be VM and engine (independent of context). Engine can be virtual or soft engine (kernel thread), each with its own queue. We can add an interface to create such engines (independent of context). But we are anway implicitly creating it when user uses a new queue_idx. If in future we have hardware engines for VM_BIND operation, we can have that explicit inteface to create engine instances and the queue_index in vm_bind/unbind will point to those engines. Anyone has any thoughts? Daniel?

Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.

So a cleaner interface to me is: user space create a vm, create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.

I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.

From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.

I didn't completely follow the discussion here. Just share some thoughts.

Yah, I agree.

Lionel, How about we define the queue as union {        __u32 queue_idx;        __u64 rsvd; }

If required, we can extend by expanding the 'rsvd' field to <ctx_id, queue_idx> later with a flag.

Niranjana

I did not really understand Oak's comment nor what you're suggesting here to be honest.

First the GEM context is already exposed to userspace. It's explicitly created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.

We give the GEM context id in every execbuffer we do with drm_i915_gem_execbuffer2::rsvd1.

It's still in the new execbuffer3 proposal being discussed.

Second, the GEM context is also where we set the VM with I915_CONTEXT_PARAM_VM.

Third, the GEM context also has the list of engines with I915_CONTEXT_PARAM_ENGINES.

So it makes sense to me to dispatch the vm_bind operation to a GEM context, to a given vm_bind queue, because it's got all the information required :

- the list of new vm_bind queues

- the vm that is going to be modified

Otherwise where do the vm_bind queues live?

In the i915/drm fd object?

That would mean that all the GEM contexts are sharing the same vm_bind queues.

intel_context or GuC are internal details we're not concerned about.

I don't really see the connection with the GEM context.

Maybe Oak has a different use case than Vulkan.

-Lionel

...

...
Regards, Oak

...
Niranjana

...
...
I think the interface is clean as a interface to VM. It is only

that we

...
...
don't have a clean way to create a raw VM_BIND engine (not associated with any context) with i915 uapi. May be we can add such an interface, but I don't think that is

worth it

...
...
(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned above). Anyone has any thoughts?

...
>Another problem is, if two VMs are binding with the same defined >engine, >binding on VM1 can get unnecessary blocked by binding on VM2 >(which may be >waiting on its in_fence).

Maybe I'm missing something, but how can you have 2 vm objects with a single gem_context right now?

No, we don't have 2 VMs for a gem_context. Say if ctx1 with vm1 and ctx2 with vm2. First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If those two queue indicies points to same underlying vm_bind engine, then the second vm_bind call gets blocked until the first vm_bind

call's

...
...
'in' fence is triggered and bind completes.

With per VM queues, this is not a problem as two VMs will not endup sharing same queue.

BTW, I just posted a updated PATCH series. https://www.spinics.net/lists/dri-devel/msg350483.html

Niranjana

...
> >So, my preference here is to just add a 'u32 queue' index in >vm_bind/unbind >ioctl, and the queues are per VM. > >Niranjana > >> Thanks, >> >> -Lionel >> >> >>      Niranjana >> >>      >Regards, >>      > >>      >Tvrtko >>      > >>      >> >>      >>Niranjana >>      >> >>      >>> >>      >>>>   I am trying to see how many queues we need and >>don't want it to >>      be >>      >>>>   arbitrarily >>      >>>>   large and unduely blow up memory usage and >>complexity in i915 >>      driver. >>      >>>> >>      >>>> I expect a Vulkan driver to use at most 2 in the >>vast majority >>      >>>>of cases. I >>      >>>> could imagine a client wanting to create more than 1

sparse

...
...
...
>>      >>>>queue in which >>      >>>> case, it'll be N+1 but that's unlikely. As far as >>complexity >>      >>>>goes, once >>      >>>> you allow two, I don't think the complexity is going

up by

...
...
...
>>      >>>>allowing N. As >>      >>>> for memory usage, creating more queues means more >>memory. That's >>      a >>      >>>> trade-off that userspace can make. Again, the >>expected number >>      >>>>here is 1 >>      >>>> or 2 in the vast majority of cases so I don't think >>you need to >>      worry. >>      >>> >>      >>>Ok, will start with n=3 meaning 8 queues. >>      >>>That would require us create 8 workqueues. >>      >>>We can change 'n' later if required. >>      >>> >>      >>>Niranjana >>      >>> >>      >>>> >>      >>>>   >     Why? Because Vulkan has two basic kind of bind >>      >>>>operations and we >>      >>>>   don't >>      >>>>   >     want any dependencies between them: >>      >>>>   >      1. Immediate. These happen right after BO >>creation or >>      >>>>maybe as >>      >>>>   part of >>      >>>>   > vkBindImageMemory() or VkBindBufferMemory(). These >>      >>>>don't happen >>      >>>>   on a >>      >>>>   >     queue and we don't want them serialized >>with anything.       To >>      >>>>   synchronize >>      >>>>   >     with submit, we'll have a syncobj in the >>VkDevice which >>      is >>      >>>>   signaled by >>      >>>>   >     all immediate bind operations and make >>submits wait on >>      it. >>      >>>>   >      2. Queued (sparse): These happen on a >>VkQueue which may >>      be the >>      >>>>   same as >>      >>>>   >     a render/compute queue or may be its own >>queue. It's up >>      to us >>      >>>>   what we >>      >>>>   >     want to advertise. From the Vulkan API >>PoV, this is like >>      any >>      >>>>   other >>      >>>>   >     queue. Operations on it wait on and signal >>semaphores.       If we >>      >>>>   have a >>      >>>>   >     VM_BIND engine, we'd provide syncobjs to

wait and

...
...
...
>>      >>>>signal just like >>      >>>>   we do >>      >>>>   >     in execbuf(). >>      >>>>   >     The important thing is that we don't want >>one type of >>      >>>>operation to >>      >>>>   block >>      >>>>   >     on the other. If immediate binds are >>blocking on sparse >>      binds, >>      >>>>   it's >>      >>>>   >     going to cause over-synchronization issues. >>      >>>>   >     In terms of the internal implementation, I >>know that >>      >>>>there's going >>      >>>>   to be >>      >>>>   >     a lock on the VM and that we can't actually >>do these >>      things in >>      >>>>   > parallel. That's fine. Once the dma_fences have >>      signaled and >>      >>>>   we're >>      >>>> >>      >>>>   Thats correct. It is like a single VM_BIND engine

with

...
...
...
>>      >>>>multiple queues >>      >>>>   feeding to it. >>      >>>> >>      >>>> Right. As long as the queues themselves are >>independent and >>      >>>>can block on >>      >>>> dma_fences without holding up other queues, I think >>we're fine. >>      >>>> >>      >>>>   > unblocked to do the bind operation, I don't care if >>      >>>>there's a bit >>      >>>>   of >>      >>>>   > synchronization due to locking. That's >>expected. What >>      >>>>we can't >>      >>>>   afford >>      >>>>   >     to have is an immediate bind operation >>suddenly blocking >>      on a >>      >>>>   sparse >>      >>>>   > operation which is blocked on a compute job >>that's going >>      to run >>      >>>>   for >>      >>>>   >     another 5ms. >>      >>>> >>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM >>doesn't block >>      the >>      >>>>   VM_BIND >>      >>>>   on other VMs. I am not sure about usecases here,

but just

...
...
...
>>      wanted to >>      >>>>   clarify. >>      >>>> >>      >>>> Yes, that's what I would expect. >>      >>>> --Jason >>      >>>> >>      >>>>   Niranjana >>      >>>> >>      >>>>   >     For reference, Windows solves this by allowing >>      arbitrarily many >>      >>>>   paging >>      >>>>   >     queues (what they call a VM_BIND >>engine/queue). That >>      >>>>design works >>      >>>>   >     pretty well and solves the problems in >>question.       >>>>Again, we could >>      >>>>   just >>      >>>>   >     make everything out-of-order and require >>using syncobjs >>      >>>>to order >>      >>>>   things >>      >>>>   >     as userspace wants. That'd be fine too. >>      >>>>   >     One more note while I'm here: danvet said >>something on >>      >>>>IRC about >>      >>>>   VM_BIND >>      >>>>   >     queues waiting for syncobjs to >>materialize. We don't >>      really >>      >>>>   want/need >>      >>>>   >     this. We already have all the machinery in >>userspace to >>      handle >>      >>>>   > wait-before-signal and waiting for syncobj >>fences to >>      >>>>materialize >>      >>>>   and >>      >>>>   >     that machinery is on by default. It would

actually

...
...
...
>>      >>>>take MORE work >>      >>>>   in >>      >>>>   >     Mesa to turn it off and take advantage of >>the kernel >>      >>>>being able to >>      >>>>   wait >>      >>>>   >     for syncobjs to materialize. Also, getting >>that right is >>      >>>>   ridiculously >>      >>>>   >     hard and I really don't want to get it >>wrong in kernel >>      >>>>space.   �� When we >>      >>>>   >     do memory fences, wait-before-signal will >>be a thing. We >>      don't >>      >>>>   need to >>      >>>>   >     try and make it a thing for syncobj. >>      >>>>   >     --Jason >>      >>>>   > >>      >>>>   >   Thanks Jason, >>      >>>>   > >>      >>>>   >   I missed the bit in the Vulkan spec that >>we're allowed to >>      have a >>      >>>>   sparse >>      >>>>   >   queue that does not implement either graphics >>or compute >>      >>>>operations >>      >>>>   : >>      >>>>   > >>      >>>>   >     "While some implementations may include >>      >>>> VK_QUEUE_SPARSE_BINDING_BIT >>      >>>>   >     support in queue families that also include >>      >>>>   > >>      >>>>   > graphics and compute support, other >>implementations may >>      only >>      >>>>   expose a >>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>      >>>>   > >>      >>>>   > family." >>      >>>>   > >>      >>>>   >   So it can all be all a vm_bind engine that

just does

...
...
...
>>      bind/unbind >>      >>>>   > operations. >>      >>>>   > >>      >>>>   >   But yes we need another engine for the >>immediate/non-sparse >>      >>>>   operations. >>      >>>>   > >>      >>>>   >   -Lionel >>      >>>>   > >>      >>>>   >         > >>      >>>>   > Daniel, any thoughts? >>      >>>>   > >>      >>>>   > Niranjana >>      >>>>   > >>      >>>>   > >Matt >>      >>>>   >       > >>      >>>>   > >> >>      >>>>   > >> Sorry I noticed this late. >>      >>>>   > >> >>      >>>>   > >> >>      >>>>   > >> -Lionel >>      >>>>   > >> >>      >>>>   > >>

Niranjana Vishwanathapura

5:01 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:

...

On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:

...
On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:

...
Regards, Oak

...
-----Original Message----- From: Intel-gfx intel-gfx-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: June 10, 2022 1:43 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Intel GFX intel-gfx@lists.freedesktop.org; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P chris.p.wilson@intel.com; Vetter, Daniel daniel.vetter@intel.com; Christian König christian.koenig@amd.com Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:

...
On 10/06/2022 10:54, Niranjana Vishwanathapura wrote:

...
On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote: >On 09/06/2022 22:31, Niranjana Vishwanathapura wrote: >>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote: >>> On 09/06/2022 00:55, Jason Ekstrand wrote: >>> >>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura >>> niranjana.vishwanathapura@intel.com wrote: >>> >>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko

Ursulin wrote:

...
...
>>>      > >>>      > >>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana >>>Vishwanathapura >>>      wrote: >>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason >>>Ekstrand wrote: >>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana

Vishwanathapura

...
...
>>>      >>>> niranjana.vishwanathapura@intel.com wrote: >>>      >>>> >>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel >>>Landwerlin >>>      wrote: >>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote: >>>      >>>>   > >>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana >>>Vishwanathapura >>>      >>>>   > niranjana.vishwanathapura@intel.com wrote: >>>      >>>>   > >>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM

-0700, Matthew

...
...
>>>      >>>>Brost wrote: >>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM

+0300, Lionel

...
...
>>>      Landwerlin >>>      >>>>   wrote: >>>      >>>>   > >> On 17/05/2022 21:32, Niranjana Vishwanathapura >>>      wrote: >>>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start >>>      >>>>   binding/unbinding >>>      >>>>   >       the mapping in an >>>      >>>>   > >> > +async worker. The binding and >>>unbinding will >>>      >>>>work like a >>>      >>>>   special >>>      >>>>   >       GPU engine. >>>      >>>>   > >> > +The binding and unbinding operations are >>>      serialized and >>>      >>>>   will >>>      >>>>   >       wait on specified >>>      >>>>   > >> > +input fences before the operation >>>and will signal >>>      the >>>      >>>>   output >>>      >>>>   >       fences upon the >>>      >>>>   > >> > +completion of the operation. Due to >>>      serialization, >>>      >>>>   completion of >>>      >>>>   >       an operation >>>      >>>>   > >> > +will also indicate that all >>>previous operations >>>      >>>>are also >>>      >>>>   > complete. >>>      >>>>   > >> >>>      >>>>   > >> I guess we should avoid saying "will >>>immediately >>>      start >>>      >>>>   > binding/unbinding" if >>>      >>>>   > >> there are fences involved. >>>      >>>>   > >> >>>      >>>>   > >> And the fact that it's happening in an async >>>      >>>>worker seem to >>>      >>>>   imply >>>      >>>>   >       it's not >>>      >>>>   > >> immediate. >>>      >>>>   > >> >>>      >>>>   > >>>      >>>>   >       Ok, will fix. >>>      >>>>   >       This was added because in earlier design >>>binding was >>>      deferred >>>      >>>>   until >>>      >>>>   >       next execbuff. >>>      >>>>   >       But now it is non-deferred (immediate in >>>that sense). >>>      >>>>But yah, >>>      >>>>   this is >>>      >>>>   > confusing >>>      >>>>   >       and will fix it. >>>      >>>>   > >>>      >>>>   > >> >>>      >>>>   > >> I have a question on the behavior of the bind >>>      >>>>operation when >>>      >>>>   no >>>      >>>>   >       input fence >>>      >>>>   > >> is provided. Let say I do : >>>      >>>>   > >> >>>      >>>>   > >> VM_BIND (out_fence=fence1) >>>      >>>>   > >> >>>      >>>>   > >> VM_BIND (out_fence=fence2) >>>      >>>>   > >> >>>      >>>>   > >> VM_BIND (out_fence=fence3) >>>      >>>>   > >> >>>      >>>>   > >> >>>      >>>>   > >> In what order are the fences going to >>>be signaled? >>>      >>>>   > >> >>>      >>>>   > >> In the order of VM_BIND ioctls? Or out >>>of order? >>>      >>>>   > >> >>>      >>>>   > >> Because you wrote "serialized I assume >>>it's : in >>>      order >>>      >>>>   > >> >>>      >>>>   > >>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND >>>ioctls. Note that >>>      >>>>bind and >>>      >>>>   unbind >>>      >>>>   >       will use >>>      >>>>   >       the same queue and hence are ordered. >>>      >>>>   > >>>      >>>>   > >> >>>      >>>>   > >> One thing I didn't realize is that >>>because we only >>>      get one >>>      >>>>   > "VM_BIND" engine, >>>      >>>>   > >> there is a disconnect from the Vulkan >>>specification. >>>      >>>>   > >> >>>      >>>>   > >> In Vulkan VM_BIND operations are >>>serialized but >>>      >>>>per engine. >>>      >>>>   > >> >>>      >>>>   > >> So you could have something like this : >>>      >>>>   > >> >>>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1, >>>      out_fence=fence2) >>>      >>>>   > >> >>>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3, >>>      out_fence=fence4) >>>      >>>>   > >> >>>      >>>>   > >> >>>      >>>>   > >> fence1 is not signaled >>>      >>>>   > >> >>>      >>>>   > >> fence3 is signaled >>>      >>>>   > >> >>>      >>>>   > >> So the second VM_BIND will proceed before the >>>      >>>>first VM_BIND. >>>      >>>>   > >> >>>      >>>>   > >> >>>      >>>>   > >> I guess we can deal with that scenario in >>>      >>>>userspace by doing >>>      >>>>   the >>>      >>>>   >       wait >>>      >>>>   > >> ourselves in one thread per engines. >>>      >>>>   > >> >>>      >>>>   > >> But then it makes the VM_BIND input >>>fences useless. >>>      >>>>   > >> >>>      >>>>   > >> >>>      >>>>   > >> Daniel : what do you think? Should be >>>rework this or >>>      just >>>      >>>>   deal with >>>      >>>>   >       wait >>>      >>>>   > >> fences in userspace? >>>      >>>>   > >> >>>      >>>>   >       > >>>      >>>>   >       >My opinion is rework this but make the >>>ordering via >>>      >>>>an engine >>>      >>>>   param >>>      >>>>   > optional. >>>      >>>>   >       > >>>      >>>>   > >e.g. A VM can be configured so all binds >>>are ordered >>>      >>>>within the >>>      >>>>   VM >>>      >>>>   >       > >>>      >>>>   > >e.g. A VM can be configured so all binds >>>accept an >>>      engine >>>      >>>>   argument >>>      >>>>   >       (in >>>      >>>>   > >the case of the i915 likely this is a >>>gem context >>>      >>>>handle) and >>>      >>>>   binds >>>      >>>>   > >ordered with respect to that engine. >>>      >>>>   >       > >>>      >>>>   > >This gives UMDs options as the later >>>likely consumes >>>      >>>>more KMD >>>      >>>>   > resources >>>      >>>>   >       >so if a different UMD can live with

binds being

...
...
>>>      >>>>ordered within >>>      >>>>   the VM >>>      >>>>   > >they can use a mode consuming less resources. >>>      >>>>   >       > >>>      >>>>   > >>>      >>>>   >       I think we need to be careful here if we >>>are looking >>>      for some >>>      >>>>   out of >>>      >>>>   > (submission) order completion of vm_bind/unbind. >>>      >>>>   > In-order completion means, in a batch of >>>binds and >>>      >>>>unbinds to be >>>      >>>>   > completed in-order, user only needs to specify >>>      >>>>in-fence for the >>>      >>>>   >       first bind/unbind call and the our-fence >>>for the last >>>      >>>>   bind/unbind >>>      >>>>   >       call. Also, the VA released by an unbind >>>call can be >>>      >>>>re-used by >>>      >>>>   >       any subsequent bind call in that

in-order batch.

...
...
>>>      >>>>   > >>>      >>>>   >       These things will break if >>>binding/unbinding were to >>>      >>>>be allowed >>>      >>>>   to >>>      >>>>   >       go out of order (of submission) and user >>>need to be >>>      extra >>>      >>>>   careful >>>      >>>>   >       not to run into pre-mature triggereing of >>>out-fence and >>>      bind >>>      >>>>   failing >>>      >>>>   >       as VA is still in use etc. >>>      >>>>   > >>>      >>>>   >       Also, VM_BIND binds the provided

mapping on the

...
...
>>>      specified >>>      >>>>   address >>>      >>>>   >       space >>>      >>>>   >       (VM). So, the uapi is not engine/context >>>specific. >>>      >>>>   > >>>      >>>>   >       We can however add a 'queue' to the uapi >>>which can be >>>      >>>>one from >>>      >>>>   the >>>      >>>>   > pre-defined queues, >>>      >>>>   > I915_VM_BIND_QUEUE_0 >>>      >>>>   > I915_VM_BIND_QUEUE_1 >>>      >>>>   >       ... >>>      >>>>   > I915_VM_BIND_QUEUE_(N-1) >>>      >>>>   > >>>      >>>>   >       KMD will spawn an async work queue for >>>each queue which >>>      will >>>      >>>>   only >>>      >>>>   >       bind the mappings on that queue in the

order of

...
...
>>>      submission. >>>      >>>>   >       User can assign the queue to per engine >>>or anything >>>      >>>>like that. >>>      >>>>   > >>>      >>>>   >       But again here, user need to be

careful and not

...
...
>>>      >>>>deadlock these >>>      >>>>   >       queues with circular dependency of fences. >>>      >>>>   > >>>      >>>>   >       I prefer adding this later an as >>>extension based on >>>      >>>>whether it >>>      >>>>   >       is really helping with the implementation. >>>      >>>>   > >>>      >>>>   >     I can tell you right now that having >>>everything on a >>>      single >>>      >>>>   in-order >>>      >>>>   >     queue will not get us the perf we want. >>>What vulkan >>>      >>>>really wants >>>      >>>>   is one >>>      >>>>   >     of two things: >>>      >>>>   >      1. No implicit ordering of VM_BIND

ops. They just

...
...
>>>      happen in >>>      >>>>   whatever >>>      >>>>   >     their dependencies are resolved and we >>>ensure ordering >>>      >>>>ourselves >>>      >>>>   by >>>      >>>>   >     having a syncobj in the VkQueue. >>>      >>>>   >      2. The ability to create multiple VM_BIND >>>queues. We >>>      need at >>>      >>>>   least 2 >>>      >>>>   >     but I don't see why there needs to be a >>>limit besides >>>      >>>>the limits >>>      >>>>   the >>>      >>>>   >     i915 API already has on the number of >>>engines. Vulkan >>>      could >>>      >>>>   expose >>>      >>>>   >     multiple sparse binding queues to the >>>client if it's not >>>      >>>>   arbitrarily >>>      >>>>   >     limited. >>>      >>>> >>>      >>>>   Thanks Jason, Lionel. >>>      >>>> >>>      >>>>   Jason, what are you referring to when you say >>>"limits the i915 >>>      API >>>      >>>>   already >>>      >>>>   has on the number of engines"? I am not sure if >>>there is such >>>      an uapi >>>      >>>>   today. >>>      >>>> >>>      >>>> There's a limit of something like 64 total engines >>>today based on >>>      the >>>      >>>> number of bits we can cram into the exec flags in >>>execbuffer2. I >>>      think >>>      >>>> someone had an extended version that allowed more >>>but I ripped it >>>      out >>>      >>>> because no one was using it. Of course, >>>execbuffer3 might not >>>      >>>>have that >>>      >>>> problem at all. >>>      >>>> >>>      >>> >>>      >>>Thanks Jason. >>>      >>>Ok, I am not sure which exec flag is that, but yah, >>>execbuffer3 >>>      probably >>>      >>>will not have this limiation. So, we need to define a >>>      VM_BIND_MAX_QUEUE >>>      >>>and somehow export it to user (I am thinking of >>>embedding it in >>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND,

bits[1-3]->'n'

...
...
>>>      meaning 2^n >>>      >>>queues. >>>      >> >>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK >>>(0x3f) which >>>      execbuf3 >>> >>>    Yup! That's exactly the limit I was talking about. >>> >>>      >>will also have. So, we can simply define in vm_bind/unbind >>>      structures, >>>      >> >>>      >>#define I915_VM_BIND_MAX_QUEUE   64 >>>      >>        __u32 queue; >>>      >> >>>      >>I think that will keep things simple. >>>      > >>>      >Hmmm? What does execbuf2 limit has to do with how

many engines

...
...
>>>      >hardware can have? I suggest not to do that. >>>      > >>>      >Change with added this: >>>      > >>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1) >>>      >               return -EINVAL; >>>      > >>>      >To context creation needs to be undone and so let users >>>create engine >>>      >maps with all hardware engines, and let execbuf3 access >>>them all. >>>      > >>> >>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to >>>execbuff3 also. >>>      Hence, I was using the same limit for VM_BIND queues >>>(64, or 65 if we >>>      make it N+1). >>>      But, as discussed in other thread of this RFC series, we >>>are planning >>>      to drop this I915_EXEC_RING_MAP in execbuff3. So,

there won't be

...
...
>>>      any uapi that limits the number of engines (and hence >>>the vm_bind >>>      queues >>>      need to be supported). >>> >>>      If we leave the number of vm_bind queues to be

arbitrarily large

...
...
>>>      (__u32 queue_idx) then, we need to have a hashmap for >>>queue (a wq, >>>      work_item and a linked list) lookup from the user >>>specified queue >>>      index. >>>      Other option is to just put some hard limit (say 64 or >>>65) and use >>>      an array of queues in VM (each created upon first use). >>>I prefer this. >>> >>>    I don't get why a VM_BIND queue is any different from any >>>other queue or >>>    userspace-visible kernel object. But I'll leave those >>>details up to >>>    danvet or whoever else might be reviewing the implementation. >>>    --Jason >>> >>> I kind of agree here. Wouldn't be simpler to have the bind >>>queue created >>> like the others when we build the engine map? >>> >>> For userspace it's then just matter of selecting the right >>>queue ID when >>> submitting. >>> >>> If there is ever a possibility to have this work on the GPU, >>>it would be >>> all ready. >>> >> >>I did sync offline with Matt Brost on this. >>We can add a VM_BIND engine class and let user create VM_BIND >>engines (queues). >>The problem is, in i915 engine creating interface is bound to >>gem_context. >>So, in vm_bind ioctl, we would need both context_id and >>queue_idx for proper >>lookup of the user created engine. This is bit ackward as

vm_bind is an

...
...
>>interface to VM (address space) and has nothing to do with

gem_context.

...
...
> > >A gem_context has a single vm object right? > >Set through I915_CONTEXT_PARAM_VM at creation or given a default >one if not. > >So it's just like picking up the vm like it's done at execbuffer >time right now : eb->context->vm >

Are you suggesting replacing 'vm_id' with 'context_id' in the VM_BIND/UNBIND ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can be obtained from the context?

Yes, because if we go for engines, they're associated with a context and so also associated with the VM bound to the context.

Hmm...context doesn't sould like the right interface. It should be VM and engine (independent of context). Engine can be virtual or soft engine (kernel thread), each with its own queue. We can add an interface to create such engines (independent of context). But we are anway implicitly creating it when user uses a new queue_idx. If in future we have hardware engines for VM_BIND operation, we can have that explicit inteface to create engine instances and the queue_index in vm_bind/unbind will point to those engines. Anyone has any thoughts? Daniel?

Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.

So a cleaner interface to me is: user space create a vm, create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.

I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.

From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.

I didn't completely follow the discussion here. Just share some thoughts.

Yah, I agree.

Lionel, How about we define the queue as union {        __u32 queue_idx;        __u64 rsvd; }

If required, we can extend by expanding the 'rsvd' field to <ctx_id, queue_idx> later with a flag.

Niranjana

I did not really understand Oak's comment nor what you're suggesting here to be honest.

First the GEM context is already exposed to userspace. It's explicitly created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.

We give the GEM context id in every execbuffer we do with drm_i915_gem_execbuffer2::rsvd1.

It's still in the new execbuffer3 proposal being discussed.

Second, the GEM context is also where we set the VM with I915_CONTEXT_PARAM_VM.

Third, the GEM context also has the list of engines with I915_CONTEXT_PARAM_ENGINES.

Yes, the execbuf and engine map creation are tied to gem_context. (which probably is not the best interface.)

...

So it makes sense to me to dispatch the vm_bind operation to a GEM context, to a given vm_bind queue, because it's got all the information required :

- the list of new vm_bind queues

- the vm that is going to be modified

But the operation is performed here on the address space (VM) which can have multiple gem_contexts referring to it. So, VM is the right interface here. We need not 'gem_context'ify it.

All we need is multiple queue support for the address space (VM). Going to gem_context for that just because we have engine creation support there seems unnecessay and not correct to me.

...

Otherwise where do the vm_bind queues live?

In the i915/drm fd object?

That would mean that all the GEM contexts are sharing the same vm_bind queues.

Not all, only the gem contexts that are using the same address space (VM). But to me the right way to describe would be that "VM will be using those queues".

Niranjana

...

intel_context or GuC are internal details we're not concerned about.

I don't really see the connection with the GEM context.

Maybe Oak has a different use case than Vulkan.

-Lionel

...
...
Regards, Oak

...
Niranjana

...
...
I think the interface is clean as a interface to VM. It is

only that we

...
...
don't have a clean way to create a raw VM_BIND engine (not associated with any context) with i915 uapi. May be we can add such an interface, but I don't think that is

worth it

...
...
(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I mentioned above). Anyone has any thoughts?

> >>Another problem is, if two VMs are binding with the same defined >>engine, >>binding on VM1 can get unnecessary blocked by binding on VM2 >>(which may be >>waiting on its in_fence). > > >Maybe I'm missing something, but how can you have 2 vm objects >with a single gem_context right now? >

No, we don't have 2 VMs for a gem_context. Say if ctx1 with vm1 and ctx2 with vm2. First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If those two queue indicies points to same underlying vm_bind engine, then the second vm_bind call gets blocked until the first

vm_bind call's

...
...
'in' fence is triggered and bind completes.

With per VM queues, this is not a problem as two VMs will not endup sharing same queue.

BTW, I just posted a updated PATCH series. https://www.spinics.net/lists/dri-devel/msg350483.html

Niranjana

> >> >>So, my preference here is to just add a 'u32 queue' index in >>vm_bind/unbind >>ioctl, and the queues are per VM. >> >>Niranjana >> >>> Thanks, >>> >>> -Lionel >>> >>> >>>      Niranjana >>> >>>      >Regards, >>>      > >>>      >Tvrtko >>>      > >>>      >> >>>      >>Niranjana >>>      >> >>>      >>> >>>      >>>>   I am trying to see how many queues we need and >>>don't want it to >>>      be >>>      >>>>   arbitrarily >>>      >>>>   large and unduely blow up memory usage and >>>complexity in i915 >>>      driver. >>>      >>>> >>>      >>>> I expect a Vulkan driver to use at most 2 in the >>>vast majority >>>      >>>>of cases. I >>>      >>>> could imagine a client wanting to create more

than 1 sparse

...
...
>>>      >>>>queue in which >>>      >>>> case, it'll be N+1 but that's unlikely. As far as >>>complexity >>>      >>>>goes, once >>>      >>>> you allow two, I don't think the complexity is

going up by

...
...
>>>      >>>>allowing N. As >>>      >>>> for memory usage, creating more queues means more >>>memory. That's >>>      a >>>      >>>> trade-off that userspace can make. Again, the >>>expected number >>>      >>>>here is 1 >>>      >>>> or 2 in the vast majority of cases so I don't think >>>you need to >>>      worry. >>>      >>> >>>      >>>Ok, will start with n=3 meaning 8 queues. >>>      >>>That would require us create 8 workqueues. >>>      >>>We can change 'n' later if required. >>>      >>> >>>      >>>Niranjana >>>      >>> >>>      >>>> >>>      >>>>   >     Why? Because Vulkan has two basic kind of bind >>>      >>>>operations and we >>>      >>>>   don't >>>      >>>>   >     want any dependencies between them: >>>      >>>>   >      1. Immediate. These happen right after BO >>>creation or >>>      >>>>maybe as >>>      >>>>   part of >>>      >>>>   > vkBindImageMemory() or VkBindBufferMemory(). These >>>      >>>>don't happen >>>      >>>>   on a >>>      >>>>   >     queue and we don't want them serialized >>>with anything.       To >>>      >>>>   synchronize >>>      >>>>   >     with submit, we'll have a syncobj in the >>>VkDevice which >>>      is >>>      >>>>   signaled by >>>      >>>>   >     all immediate bind operations and make >>>submits wait on >>>      it. >>>      >>>>   >      2. Queued (sparse): These happen on a >>>VkQueue which may >>>      be the >>>      >>>>   same as >>>      >>>>   >     a render/compute queue or may be its own >>>queue. It's up >>>      to us >>>      >>>>   what we >>>      >>>>   >     want to advertise. From the Vulkan API >>>PoV, this is like >>>      any >>>      >>>>   other >>>      >>>>   >     queue. Operations on it wait on and signal >>>semaphores.       If we >>>      >>>>   have a >>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to

wait and

...
...
>>>      >>>>signal just like >>>      >>>>   we do >>>      >>>>   >     in execbuf(). >>>      >>>>   >     The important thing is that we don't want >>>one type of >>>      >>>>operation to >>>      >>>>   block >>>      >>>>   >     on the other. If immediate binds are >>>blocking on sparse >>>      binds, >>>      >>>>   it's >>>      >>>>   >     going to cause over-synchronization issues. >>>      >>>>   >     In terms of the internal implementation, I >>>know that >>>      >>>>there's going >>>      >>>>   to be >>>      >>>>   >     a lock on the VM and that we can't actually >>>do these >>>      things in >>>      >>>>   > parallel. That's fine. Once the dma_fences have >>>      signaled and >>>      >>>>   we're >>>      >>>> >>>      >>>>   Thats correct. It is like a single VM_BIND

engine with

...
...
>>>      >>>>multiple queues >>>      >>>>   feeding to it. >>>      >>>> >>>      >>>> Right. As long as the queues themselves are >>>independent and >>>      >>>>can block on >>>      >>>> dma_fences without holding up other queues, I think >>>we're fine. >>>      >>>> >>>      >>>>   > unblocked to do the bind operation, I don't care if >>>      >>>>there's a bit >>>      >>>>   of >>>      >>>>   > synchronization due to locking. That's >>>expected. What >>>      >>>>we can't >>>      >>>>   afford >>>      >>>>   >     to have is an immediate bind operation >>>suddenly blocking >>>      on a >>>      >>>>   sparse >>>      >>>>   > operation which is blocked on a compute job >>>that's going >>>      to run >>>      >>>>   for >>>      >>>>   >     another 5ms. >>>      >>>> >>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one VM >>>doesn't block >>>      the >>>      >>>>   VM_BIND >>>      >>>>   on other VMs. I am not sure about usecases

here, but just

...
...
>>>      wanted to >>>      >>>>   clarify. >>>      >>>> >>>      >>>> Yes, that's what I would expect. >>>      >>>> --Jason >>>      >>>> >>>      >>>>   Niranjana >>>      >>>> >>>      >>>>   >     For reference, Windows solves this by allowing >>>      arbitrarily many >>>      >>>>   paging >>>      >>>>   >     queues (what they call a VM_BIND >>>engine/queue). That >>>      >>>>design works >>>      >>>>   >     pretty well and solves the problems in >>>question.       >>>>Again, we could >>>      >>>>   just >>>      >>>>   >     make everything out-of-order and require >>>using syncobjs >>>      >>>>to order >>>      >>>>   things >>>      >>>>   >     as userspace wants. That'd be fine too. >>>      >>>>   >     One more note while I'm here: danvet said >>>something on >>>      >>>>IRC about >>>      >>>>   VM_BIND >>>      >>>>   >     queues waiting for syncobjs to >>>materialize. We don't >>>      really >>>      >>>>   want/need >>>      >>>>   >     this. We already have all the machinery in >>>userspace to >>>      handle >>>      >>>>   > wait-before-signal and waiting for syncobj >>>fences to >>>      >>>>materialize >>>      >>>>   and >>>      >>>>   >     that machinery is on by default. It

would actually

...
...
>>>      >>>>take MORE work >>>      >>>>   in >>>      >>>>   >     Mesa to turn it off and take advantage of >>>the kernel >>>      >>>>being able to >>>      >>>>   wait >>>      >>>>   >     for syncobjs to materialize. Also, getting >>>that right is >>>      >>>>   ridiculously >>>      >>>>   >     hard and I really don't want to get it >>>wrong in kernel >>>      >>>>space.   �� When we >>>      >>>>   >     do memory fences, wait-before-signal will >>>be a thing. We >>>      don't >>>      >>>>   need to >>>      >>>>   >     try and make it a thing for syncobj. >>>      >>>>   >     --Jason >>>      >>>>   > >>>      >>>>   >   Thanks Jason, >>>      >>>>   > >>>      >>>>   >   I missed the bit in the Vulkan spec that >>>we're allowed to >>>      have a >>>      >>>>   sparse >>>      >>>>   >   queue that does not implement either graphics >>>or compute >>>      >>>>operations >>>      >>>>   : >>>      >>>>   > >>>      >>>>   >     "While some implementations may include >>>      >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>      >>>>   >     support in queue families that also include >>>      >>>>   > >>>      >>>>   > graphics and compute support, other >>>implementations may >>>      only >>>      >>>>   expose a >>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>      >>>>   > >>>      >>>>   > family." >>>      >>>>   > >>>      >>>>   >   So it can all be all a vm_bind engine that

just does

...
...
>>>      bind/unbind >>>      >>>>   > operations. >>>      >>>>   > >>>      >>>>   >   But yes we need another engine for the >>>immediate/non-sparse >>>      >>>>   operations. >>>      >>>>   > >>>      >>>>   >   -Lionel >>>      >>>>   > >>>      >>>>   >         > >>>      >>>>   > Daniel, any thoughts? >>>      >>>>   > >>>      >>>>   > Niranjana >>>      >>>>   > >>>      >>>>   > >Matt >>>      >>>>   >       > >>>      >>>>   > >> >>>      >>>>   > >> Sorry I noticed this late. >>>      >>>>   > >> >>>      >>>>   > >> >>>      >>>>   > >> -Lionel >>>      >>>>   > >> >>>      >>>>   > >> > >

Zeng, Oak

9:12 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

Thanks, Oak

...

-----Original Message----- From: Vishwanathapura, Niranjana niranjana.vishwanathapura@intel.com Sent: June 14, 2022 1:02 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Zeng, Oak oak.zeng@intel.com; Intel GFX <intel- gfx@lists.freedesktop.org>; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P chris.p.wilson@intel.com; Vetter, Daniel daniel.vetter@intel.com; Christian König christian.koenig@amd.com Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:

...
On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:

...
On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:

...
Regards, Oak

...
-----Original Message----- From: Intel-gfx intel-gfx-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: June 10, 2022 1:43 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Intel GFX intel-gfx@lists.freedesktop.org; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P chris.p.wilson@intel.com; Vetter, Daniel daniel.vetter@intel.com; Christian König

christian.koenig@amd.com

...
...
...
...
Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote:

...
On 10/06/2022 10:54, Niranjana Vishwanathapura wrote: >On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote: >>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote: >>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin wrote: >>>> On 09/06/2022 00:55, Jason Ekstrand wrote: >>>> >>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura >>>> niranjana.vishwanathapura@intel.com wrote: >>>> >>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko

Ursulin wrote:

...
>>>>      > >>>>      > >>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana >>>>Vishwanathapura >>>>      wrote: >>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason >>>>Ekstrand wrote: >>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana

Vishwanathapura

...
>>>>      >>>> niranjana.vishwanathapura@intel.com wrote: >>>>      >>>> >>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel >>>>Landwerlin >>>>      wrote: >>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote: >>>>      >>>>   > >>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana >>>>Vishwanathapura >>>>      >>>>   > niranjana.vishwanathapura@intel.com wrote: >>>>      >>>>   > >>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM

-0700, Matthew

...
>>>>      >>>>Brost wrote: >>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM

+0300, Lionel

...
>>>>      Landwerlin >>>>      >>>>   wrote: >>>>      >>>>   > >> On 17/05/2022 21:32, Niranjana Vishwanathapura >>>>      wrote: >>>>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start >>>>      >>>>   binding/unbinding >>>>      >>>>   >       the mapping in an >>>>      >>>>   > >> > +async worker. The binding and >>>>unbinding will >>>>      >>>>work like a >>>>      >>>>   special >>>>      >>>>   >       GPU engine. >>>>      >>>>   > >> > +The binding and unbinding operations are >>>>      serialized and >>>>      >>>>   will >>>>      >>>>   >       wait on specified >>>>      >>>>   > >> > +input fences before the operation >>>>and will signal >>>>      the >>>>      >>>>   output >>>>      >>>>   >       fences upon the >>>>      >>>>   > >> > +completion of the operation. Due to >>>>      serialization, >>>>      >>>>   completion of >>>>      >>>>   >       an operation >>>>      >>>>   > >> > +will also indicate that all >>>>previous operations >>>>      >>>>are also >>>>      >>>>   > complete. >>>>      >>>>   > >> >>>>      >>>>   > >> I guess we should avoid saying "will >>>>immediately >>>>      start >>>>      >>>>   > binding/unbinding" if >>>>      >>>>   > >> there are fences involved. >>>>      >>>>   > >> >>>>      >>>>   > >> And the fact that it's happening in an async >>>>      >>>>worker seem to >>>>      >>>>   imply >>>>      >>>>   >       it's not >>>>      >>>>   > >> immediate. >>>>      >>>>   > >> >>>>      >>>>   > >>>>      >>>>   >       Ok, will fix. >>>>      >>>>   >       This was added because in earlier design >>>>binding was >>>>      deferred >>>>      >>>>   until >>>>      >>>>   >       next execbuff. >>>>      >>>>   >       But now it is non-deferred (immediate in >>>>that sense). >>>>      >>>>But yah, >>>>      >>>>   this is >>>>      >>>>   > confusing >>>>      >>>>   >       and will fix it. >>>>      >>>>   > >>>>      >>>>   > >> >>>>      >>>>   > >> I have a question on the behavior of the bind >>>>      >>>>operation when >>>>      >>>>   no >>>>      >>>>   >       input fence >>>>      >>>>   > >> is provided. Let say I do : >>>>      >>>>   > >> >>>>      >>>>   > >> VM_BIND (out_fence=fence1) >>>>      >>>>   > >> >>>>      >>>>   > >> VM_BIND (out_fence=fence2) >>>>      >>>>   > >> >>>>      >>>>   > >> VM_BIND (out_fence=fence3) >>>>      >>>>   > >> >>>>      >>>>   > >> >>>>      >>>>   > >> In what order are the fences going to >>>>be signaled? >>>>      >>>>   > >> >>>>      >>>>   > >> In the order of VM_BIND ioctls? Or out >>>>of order? >>>>      >>>>   > >> >>>>      >>>>   > >> Because you wrote "serialized I assume >>>>it's : in >>>>      order >>>>      >>>>   > >> >>>>      >>>>   > >>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND >>>>ioctls. Note that >>>>      >>>>bind and >>>>      >>>>   unbind >>>>      >>>>   >       will use >>>>      >>>>   >       the same queue and hence are ordered. >>>>      >>>>   > >>>>      >>>>   > >> >>>>      >>>>   > >> One thing I didn't realize is that >>>>because we only >>>>      get one >>>>      >>>>   > "VM_BIND" engine, >>>>      >>>>   > >> there is a disconnect from the Vulkan >>>>specification. >>>>      >>>>   > >> >>>>      >>>>   > >> In Vulkan VM_BIND operations are >>>>serialized but >>>>      >>>>per engine. >>>>      >>>>   > >> >>>>      >>>>   > >> So you could have something like this : >>>>      >>>>   > >> >>>>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1, >>>>      out_fence=fence2) >>>>      >>>>   > >> >>>>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3, >>>>      out_fence=fence4) >>>>      >>>>   > >> >>>>      >>>>   > >> >>>>      >>>>   > >> fence1 is not signaled >>>>      >>>>   > >> >>>>      >>>>   > >> fence3 is signaled >>>>      >>>>   > >> >>>>      >>>>   > >> So the second VM_BIND will proceed before the >>>>      >>>>first VM_BIND. >>>>      >>>>   > >> >>>>      >>>>   > >> >>>>      >>>>   > >> I guess we can deal with that scenario in >>>>      >>>>userspace by doing >>>>      >>>>   the >>>>      >>>>   >       wait >>>>      >>>>   > >> ourselves in one thread per engines. >>>>      >>>>   > >> >>>>      >>>>   > >> But then it makes the VM_BIND input >>>>fences useless. >>>>      >>>>   > >> >>>>      >>>>   > >> >>>>      >>>>   > >> Daniel : what do you think? Should be >>>>rework this or >>>>      just >>>>      >>>>   deal with >>>>      >>>>   >       wait >>>>      >>>>   > >> fences in userspace? >>>>      >>>>   > >> >>>>      >>>>   >       > >>>>      >>>>   >       >My opinion is rework this but make the >>>>ordering via >>>>      >>>>an engine >>>>      >>>>   param >>>>      >>>>   > optional. >>>>      >>>>   >       > >>>>      >>>>   > >e.g. A VM can be configured so all binds >>>>are ordered >>>>      >>>>within the >>>>      >>>>   VM >>>>      >>>>   >       > >>>>      >>>>   > >e.g. A VM can be configured so all binds >>>>accept an >>>>      engine >>>>      >>>>   argument >>>>      >>>>   >       (in >>>>      >>>>   > >the case of the i915 likely this is a >>>>gem context >>>>      >>>>handle) and >>>>      >>>>   binds >>>>      >>>>   > >ordered with respect to that engine. >>>>      >>>>   >       > >>>>      >>>>   > >This gives UMDs options as the later >>>>likely consumes >>>>      >>>>more KMD >>>>      >>>>   > resources >>>>      >>>>   >       >so if a different UMD can live with

binds being

...
>>>>      >>>>ordered within >>>>      >>>>   the VM >>>>      >>>>   > >they can use a mode consuming less resources. >>>>      >>>>   >       > >>>>      >>>>   > >>>>      >>>>   >       I think we need to be careful here if we >>>>are looking >>>>      for some >>>>      >>>>   out of >>>>      >>>>   > (submission) order completion of vm_bind/unbind. >>>>      >>>>   > In-order completion means, in a batch of >>>>binds and >>>>      >>>>unbinds to be >>>>      >>>>   > completed in-order, user only needs to specify >>>>      >>>>in-fence for the >>>>      >>>>   >       first bind/unbind call and the our-fence >>>>for the last >>>>      >>>>   bind/unbind >>>>      >>>>   >       call. Also, the VA released by an unbind >>>>call can be >>>>      >>>>re-used by >>>>      >>>>   >       any subsequent bind call in that

in-order batch.

...
>>>>      >>>>   > >>>>      >>>>   >       These things will break if >>>>binding/unbinding were to >>>>      >>>>be allowed >>>>      >>>>   to >>>>      >>>>   >       go out of order (of submission) and user >>>>need to be >>>>      extra >>>>      >>>>   careful >>>>      >>>>   >       not to run into pre-mature triggereing of >>>>out-fence and >>>>      bind >>>>      >>>>   failing >>>>      >>>>   >       as VA is still in use etc. >>>>      >>>>   > >>>>      >>>>   >       Also, VM_BIND binds the provided

mapping on the

...
>>>>      specified >>>>      >>>>   address >>>>      >>>>   >       space >>>>      >>>>   >       (VM). So, the uapi is not engine/context >>>>specific. >>>>      >>>>   > >>>>      >>>>   >       We can however add a 'queue' to the uapi >>>>which can be >>>>      >>>>one from >>>>      >>>>   the >>>>      >>>>   > pre-defined queues, >>>>      >>>>   > I915_VM_BIND_QUEUE_0 >>>>      >>>>   > I915_VM_BIND_QUEUE_1 >>>>      >>>>   >       ... >>>>      >>>>   > I915_VM_BIND_QUEUE_(N-1) >>>>      >>>>   > >>>>      >>>>   >       KMD will spawn an async work queue for >>>>each queue which >>>>      will >>>>      >>>>   only >>>>      >>>>   >       bind the mappings on that queue in the

order of

...
>>>>      submission. >>>>      >>>>   >       User can assign the queue to per engine >>>>or anything >>>>      >>>>like that. >>>>      >>>>   > >>>>      >>>>   >       But again here, user need to be

careful and not

...
>>>>      >>>>deadlock these >>>>      >>>>   >       queues with circular dependency of fences. >>>>      >>>>   > >>>>      >>>>   >       I prefer adding this later an as >>>>extension based on >>>>      >>>>whether it >>>>      >>>>   >       is really helping with the implementation. >>>>      >>>>   > >>>>      >>>>   >     I can tell you right now that having >>>>everything on a >>>>      single >>>>      >>>>   in-order >>>>      >>>>   >     queue will not get us the perf we want. >>>>What vulkan >>>>      >>>>really wants >>>>      >>>>   is one >>>>      >>>>   >     of two things: >>>>      >>>>   >      1. No implicit ordering of VM_BIND

ops. They just

...
>>>>      happen in >>>>      >>>>   whatever >>>>      >>>>   >     their dependencies are resolved and we >>>>ensure ordering >>>>      >>>>ourselves >>>>      >>>>   by >>>>      >>>>   >     having a syncobj in the VkQueue. >>>>      >>>>   >      2. The ability to create multiple VM_BIND >>>>queues. We >>>>      need at >>>>      >>>>   least 2 >>>>      >>>>   >     but I don't see why there needs to be a >>>>limit besides >>>>      >>>>the limits >>>>      >>>>   the >>>>      >>>>   >     i915 API already has on the number of >>>>engines. Vulkan >>>>      could >>>>      >>>>   expose >>>>      >>>>   >     multiple sparse binding queues to the >>>>client if it's not >>>>      >>>>   arbitrarily >>>>      >>>>   >     limited. >>>>      >>>> >>>>      >>>>   Thanks Jason, Lionel. >>>>      >>>> >>>>      >>>>   Jason, what are you referring to when you say >>>>"limits the i915 >>>>      API >>>>      >>>>   already >>>>      >>>>   has on the number of engines"? I am not sure if >>>>there is such >>>>      an uapi >>>>      >>>>   today. >>>>      >>>> >>>>      >>>> There's a limit of something like 64 total engines >>>>today based on >>>>      the >>>>      >>>> number of bits we can cram into the exec flags in >>>>execbuffer2. I >>>>      think >>>>      >>>> someone had an extended version that allowed more >>>>but I ripped it >>>>      out >>>>      >>>> because no one was using it. Of course, >>>>execbuffer3 might not >>>>      >>>>have that >>>>      >>>> problem at all. >>>>      >>>> >>>>      >>> >>>>      >>>Thanks Jason. >>>>      >>>Ok, I am not sure which exec flag is that, but yah, >>>>execbuffer3 >>>>      probably >>>>      >>>will not have this limiation. So, we need to define a >>>>      VM_BIND_MAX_QUEUE >>>>      >>>and somehow export it to user (I am thinking of >>>>embedding it in >>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND,

bits[1-3]->'n'

...
>>>>      meaning 2^n >>>>      >>>queues. >>>>      >> >>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK >>>>(0x3f) which >>>>      execbuf3 >>>> >>>>    Yup! That's exactly the limit I was talking about. >>>> >>>>      >>will also have. So, we can simply define in vm_bind/unbind >>>>      structures, >>>>      >> >>>>      >>#define I915_VM_BIND_MAX_QUEUE   64 >>>>      >>        __u32 queue; >>>>      >> >>>>      >>I think that will keep things simple. >>>>      > >>>>      >Hmmm? What does execbuf2 limit has to do with how

many engines

...
>>>>      >hardware can have? I suggest not to do that. >>>>      > >>>>      >Change with added this: >>>>      > >>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1) >>>>      >               return -EINVAL; >>>>      > >>>>      >To context creation needs to be undone and so let users >>>>create engine >>>>      >maps with all hardware engines, and let execbuf3 access >>>>them all. >>>>      > >>>> >>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to >>>>execbuff3 also. >>>>      Hence, I was using the same limit for VM_BIND queues >>>>(64, or 65 if we >>>>      make it N+1). >>>>      But, as discussed in other thread of this RFC series, we >>>>are planning >>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So,

there won't be

...
>>>>      any uapi that limits the number of engines (and hence >>>>the vm_bind >>>>      queues >>>>      need to be supported). >>>> >>>>      If we leave the number of vm_bind queues to be

arbitrarily large

...
>>>>      (__u32 queue_idx) then, we need to have a hashmap for >>>>queue (a wq, >>>>      work_item and a linked list) lookup from the user >>>>specified queue >>>>      index. >>>>      Other option is to just put some hard limit (say 64 or >>>>65) and use >>>>      an array of queues in VM (each created upon first use). >>>>I prefer this. >>>> >>>>    I don't get why a VM_BIND queue is any different from any >>>>other queue or >>>>    userspace-visible kernel object. But I'll leave those >>>>details up to >>>>    danvet or whoever else might be reviewing the

implementation.

...
...
...
...
...
>>>>    --Jason >>>> >>>> I kind of agree here. Wouldn't be simpler to have the bind >>>>queue created >>>> like the others when we build the engine map? >>>> >>>> For userspace it's then just matter of selecting the right >>>>queue ID when >>>> submitting. >>>> >>>> If there is ever a possibility to have this work on the GPU, >>>>it would be >>>> all ready. >>>> >>> >>>I did sync offline with Matt Brost on this. >>>We can add a VM_BIND engine class and let user create VM_BIND >>>engines (queues). >>>The problem is, in i915 engine creating interface is bound to >>>gem_context. >>>So, in vm_bind ioctl, we would need both context_id and >>>queue_idx for proper >>>lookup of the user created engine. This is bit ackward as

vm_bind is an

...
>>>interface to VM (address space) and has nothing to do with

gem_context.

...
>> >> >>A gem_context has a single vm object right? >> >>Set through I915_CONTEXT_PARAM_VM at creation or given a

default

...
...
...
...
...
>>one if not. >> >>So it's just like picking up the vm like it's done at execbuffer >>time right now : eb->context->vm >> > >Are you suggesting replacing 'vm_id' with 'context_id' in the >VM_BIND/UNBIND >ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can

be

...
...
...
...
...
>obtained >from the context?

Yes, because if we go for engines, they're associated with a context and so also associated with the VM bound to the context.

Hmm...context doesn't sould like the right interface. It should be VM and engine (independent of context). Engine can be virtual or soft engine (kernel thread), each with its own queue. We can add an interface to create such engines (independent of context). But we are anway implicitly creating it when user uses a new queue_idx. If in future we have hardware engines for VM_BIND operation, we can have that explicit inteface to create engine instances and the queue_index in vm_bind/unbind will point to those engines. Anyone has any thoughts? Daniel?

Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.

So a cleaner interface to me is: user space create a vm, create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.

I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.

From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.

I didn't completely follow the discussion here. Just share some thoughts.

Yah, I agree.

Lionel, How about we define the queue as union {        __u32 queue_idx;        __u64 rsvd; }

If required, we can extend by expanding the 'rsvd' field to <ctx_id, queue_idx> later with a flag.

Niranjana

I did not really understand Oak's comment nor what you're suggesting here to be honest.

First the GEM context is already exposed to userspace. It's explicitly created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.

We give the GEM context id in every execbuffer we do with drm_i915_gem_execbuffer2::rsvd1.

It's still in the new execbuffer3 proposal being discussed.

Second, the GEM context is also where we set the VM with I915_CONTEXT_PARAM_VM.

Third, the GEM context also has the list of engines with I915_CONTEXT_PARAM_ENGINES.

Yes, the execbuf and engine map creation are tied to gem_context. (which probably is not the best interface.)

...
So it makes sense to me to dispatch the vm_bind operation to a GEM context, to a given vm_bind queue, because it's got all the information required :

- the list of new vm_bind queues

- the vm that is going to be modified

But the operation is performed here on the address space (VM) which can have multiple gem_contexts referring to it. So, VM is the right interface here. We need not 'gem_context'ify it.

All we need is multiple queue support for the address space (VM). Going to gem_context for that just because we have engine creation support there seems unnecessay and not correct to me.

...
Otherwise where do the vm_bind queues live?

In the i915/drm fd object?

That would mean that all the GEM contexts are sharing the same vm_bind queues.

Not all, only the gem contexts that are using the same address space (VM). But to me the right way to describe would be that "VM will be using those queues".

I hope by "queue" here you mean a HW resource that will be later used to execute the job, for example a ccs compute engine. Of course queue can be virtual so user can create more queues than what hw physically has.

To express the concept of "VM will be using those queues", I think it make sense to have create_queue(vm) function taking a vm parameter. This means this queue is created for the purpose of submit job under this VM. Later on, we can submit job (referring to objects vm_bound to the same vm) to the queue. The vm_bind ioctl doesn’t need to have queue parameter, just vm_bind (object, va, vm).

I hope the "queue" here is not the engine used to perform the vm_bind operation itself. But if you meant a queue/engine to perform vm_bind itself (vs a queue/engine for later job submission), then we can discuss more. I know xe driver have similar concept and I think align the design early can benefit the migration to xe driver.

Regards, Oak

...

Niranjana

...
intel_context or GuC are internal details we're not concerned about.

I don't really see the connection with the GEM context.

Maybe Oak has a different use case than Vulkan.

-Lionel

...
...
Regards, Oak

...
Niranjana

...
>I think the interface is clean as a interface to VM. It is

only that we

...
>don't have a clean way to create a raw VM_BIND engine (not >associated with >any context) with i915 uapi. >May be we can add such an interface, but I don't think that is

worth it

...
>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl as I >mentioned >above). >Anyone has any thoughts? > >> >>>Another problem is, if two VMs are binding with the same defined >>>engine, >>>binding on VM1 can get unnecessary blocked by binding on VM2 >>>(which may be >>>waiting on its in_fence). >> >> >>Maybe I'm missing something, but how can you have 2 vm objects >>with a single gem_context right now? >> > >No, we don't have 2 VMs for a gem_context. >Say if ctx1 with vm1 and ctx2 with vm2. >First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. >Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If >those two queue indicies points to same underlying vm_bind engine, >then the second vm_bind call gets blocked until the first

vm_bind call's

...
>'in' fence is triggered and bind completes. > >With per VM queues, this is not a problem as two VMs will not endup >sharing same queue. > >BTW, I just posted a updated PATCH series. >https://www.spinics.net/lists/dri-devel/msg350483.html > >Niranjana > >> >>> >>>So, my preference here is to just add a 'u32 queue' index in >>>vm_bind/unbind >>>ioctl, and the queues are per VM. >>> >>>Niranjana >>> >>>> Thanks, >>>> >>>> -Lionel >>>> >>>> >>>>      Niranjana >>>> >>>>      >Regards, >>>>      > >>>>      >Tvrtko >>>>      > >>>>      >> >>>>      >>Niranjana >>>>      >> >>>>      >>> >>>>      >>>>   I am trying to see how many queues we need and >>>>don't want it to >>>>      be >>>>      >>>>   arbitrarily >>>>      >>>>   large and unduely blow up memory usage and >>>>complexity in i915 >>>>      driver. >>>>      >>>> >>>>      >>>> I expect a Vulkan driver to use at most 2 in the >>>>vast majority >>>>      >>>>of cases. I >>>>      >>>> could imagine a client wanting to create more

than 1 sparse

...
>>>>      >>>>queue in which >>>>      >>>> case, it'll be N+1 but that's unlikely. As far as >>>>complexity >>>>      >>>>goes, once >>>>      >>>> you allow two, I don't think the complexity is

going up by

...
>>>>      >>>>allowing N. As >>>>      >>>> for memory usage, creating more queues means more >>>>memory. That's >>>>      a >>>>      >>>> trade-off that userspace can make. Again, the >>>>expected number >>>>      >>>>here is 1 >>>>      >>>> or 2 in the vast majority of cases so I don't think >>>>you need to >>>>      worry. >>>>      >>> >>>>      >>>Ok, will start with n=3 meaning 8 queues. >>>>      >>>That would require us create 8 workqueues. >>>>      >>>We can change 'n' later if required. >>>>      >>> >>>>      >>>Niranjana >>>>      >>> >>>>      >>>> >>>>      >>>>   >     Why? Because Vulkan has two basic kind of bind >>>>      >>>>operations and we >>>>      >>>>   don't >>>>      >>>>   >     want any dependencies between them: >>>>      >>>>   >      1. Immediate. These happen right after BO >>>>creation or >>>>      >>>>maybe as >>>>      >>>>   part of >>>>      >>>>   > vkBindImageMemory() or

VkBindBufferMemory(). These

...
...
...
...
...
>>>>      >>>>don't happen >>>>      >>>>   on a >>>>      >>>>   >     queue and we don't want them serialized >>>>with anything.       To >>>>      >>>>   synchronize >>>>      >>>>   >     with submit, we'll have a syncobj in the >>>>VkDevice which >>>>      is >>>>      >>>>   signaled by >>>>      >>>>   >     all immediate bind operations and make >>>>submits wait on >>>>      it. >>>>      >>>>   >      2. Queued (sparse): These happen on a >>>>VkQueue which may >>>>      be the >>>>      >>>>   same as >>>>      >>>>   >     a render/compute queue or may be its own >>>>queue. It's up >>>>      to us >>>>      >>>>   what we >>>>      >>>>   >     want to advertise. From the Vulkan API >>>>PoV, this is like >>>>      any >>>>      >>>>   other >>>>      >>>>   >     queue. Operations on it wait on and signal >>>>semaphores.       If we >>>>      >>>>   have a >>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to

wait and

...
>>>>      >>>>signal just like >>>>      >>>>   we do >>>>      >>>>   >     in execbuf(). >>>>      >>>>   >     The important thing is that we don't want >>>>one type of >>>>      >>>>operation to >>>>      >>>>   block >>>>      >>>>   >     on the other. If immediate binds are >>>>blocking on sparse >>>>      binds, >>>>      >>>>   it's >>>>      >>>>   >     going to cause over-synchronization issues. >>>>      >>>>   >     In terms of the internal implementation, I >>>>know that >>>>      >>>>there's going >>>>      >>>>   to be >>>>      >>>>   >     a lock on the VM and that we can't actually >>>>do these >>>>      things in >>>>      >>>>   > parallel. That's fine. Once the dma_fences have >>>>      signaled and >>>>      >>>>   we're >>>>      >>>> >>>>      >>>>   Thats correct. It is like a single VM_BIND

engine with

...
>>>>      >>>>multiple queues >>>>      >>>>   feeding to it. >>>>      >>>> >>>>      >>>> Right. As long as the queues themselves are >>>>independent and >>>>      >>>>can block on >>>>      >>>> dma_fences without holding up other queues, I think >>>>we're fine. >>>>      >>>> >>>>      >>>>   > unblocked to do the bind operation, I don't care if >>>>      >>>>there's a bit >>>>      >>>>   of >>>>      >>>>   > synchronization due to locking. That's >>>>expected. What >>>>      >>>>we can't >>>>      >>>>   afford >>>>      >>>>   >     to have is an immediate bind operation >>>>suddenly blocking >>>>      on a >>>>      >>>>   sparse >>>>      >>>>   > operation which is blocked on a compute job >>>>that's going >>>>      to run >>>>      >>>>   for >>>>      >>>>   >     another 5ms. >>>>      >>>> >>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one

VM

...
...
...
...
...
>>>>doesn't block >>>>      the >>>>      >>>>   VM_BIND >>>>      >>>>   on other VMs. I am not sure about usecases

here, but just

...
>>>>      wanted to >>>>      >>>>   clarify. >>>>      >>>> >>>>      >>>> Yes, that's what I would expect. >>>>      >>>> --Jason >>>>      >>>> >>>>      >>>>   Niranjana >>>>      >>>> >>>>      >>>>   >     For reference, Windows solves this by allowing >>>>      arbitrarily many >>>>      >>>>   paging >>>>      >>>>   >     queues (what they call a VM_BIND >>>>engine/queue). That >>>>      >>>>design works >>>>      >>>>   >     pretty well and solves the problems in >>>>question.       >>>>Again, we could >>>>      >>>>   just >>>>      >>>>   >     make everything out-of-order and require >>>>using syncobjs >>>>      >>>>to order >>>>      >>>>   things >>>>      >>>>   >     as userspace wants. That'd be fine too. >>>>      >>>>   >     One more note while I'm here: danvet said >>>>something on >>>>      >>>>IRC about >>>>      >>>>   VM_BIND >>>>      >>>>   >     queues waiting for syncobjs to >>>>materialize. We don't >>>>      really >>>>      >>>>   want/need >>>>      >>>>   >     this. We already have all the machinery in >>>>userspace to >>>>      handle >>>>      >>>>   > wait-before-signal and waiting for syncobj >>>>fences to >>>>      >>>>materialize >>>>      >>>>   and >>>>      >>>>   >     that machinery is on by default. It

would actually

...
>>>>      >>>>take MORE work >>>>      >>>>   in >>>>      >>>>   >     Mesa to turn it off and take advantage of >>>>the kernel >>>>      >>>>being able to >>>>      >>>>   wait >>>>      >>>>   >     for syncobjs to materialize. Also, getting >>>>that right is >>>>      >>>>   ridiculously >>>>      >>>>   >     hard and I really don't want to get it >>>>wrong in kernel >>>>      >>>>space.   �� When we >>>>      >>>>   >     do memory fences, wait-before-signal will >>>>be a thing. We >>>>      don't >>>>      >>>>   need to >>>>      >>>>   >     try and make it a thing for syncobj. >>>>      >>>>   >     --Jason >>>>      >>>>   > >>>>      >>>>   >   Thanks Jason, >>>>      >>>>   > >>>>      >>>>   >   I missed the bit in the Vulkan spec that >>>>we're allowed to >>>>      have a >>>>      >>>>   sparse >>>>      >>>>   >   queue that does not implement either graphics >>>>or compute >>>>      >>>>operations >>>>      >>>>   : >>>>      >>>>   > >>>>      >>>>   >     "While some implementations may include >>>>      >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>>      >>>>   >     support in queue families that also include >>>>      >>>>   > >>>>      >>>>   > graphics and compute support, other >>>>implementations may >>>>      only >>>>      >>>>   expose a >>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>>      >>>>   > >>>>      >>>>   > family." >>>>      >>>>   > >>>>      >>>>   >   So it can all be all a vm_bind engine that

just does

...
>>>>      bind/unbind >>>>      >>>>   > operations. >>>>      >>>>   > >>>>      >>>>   >   But yes we need another engine for the >>>>immediate/non-sparse >>>>      >>>>   operations. >>>>      >>>>   > >>>>      >>>>   >   -Lionel >>>>      >>>>   > >>>>      >>>>   >         > >>>>      >>>>   > Daniel, any thoughts? >>>>      >>>>   > >>>>      >>>>   > Niranjana >>>>      >>>>   > >>>>      >>>>   > >Matt >>>>      >>>>   >       > >>>>      >>>>   > >> >>>>      >>>>   > >> Sorry I noticed this late. >>>>      >>>>   > >> >>>>      >>>>   > >> >>>>      >>>>   > >> -Lionel >>>>      >>>>   > >> >>>>      >>>>   > >> >> >>

Zeng, Oak

9:47 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

Thanks, Oak

...

-----Original Message----- From: dri-devel dri-devel-bounces@lists.freedesktop.org On Behalf Of Zeng, Oak Sent: June 14, 2022 5:13 PM To: Vishwanathapura, Niranjana niranjana.vishwanathapura@intel.com; Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Intel GFX intel-gfx@lists.freedesktop.org; Wilson, Chris P chris.p.wilson@intel.com; Hellstrom, Thomas thomas.hellstrom@intel.com; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Vetter, Daniel daniel.vetter@intel.com; Christian König christian.koenig@amd.com Subject: RE: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

Thanks, Oak

...
-----Original Message----- From: Vishwanathapura, Niranjana niranjana.vishwanathapura@intel.com Sent: June 14, 2022 1:02 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Zeng, Oak oak.zeng@intel.com; Intel GFX <intel- gfx@lists.freedesktop.org>; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P

chris.p.wilson@intel.com;

...
Vetter, Daniel daniel.vetter@intel.com; Christian König christian.koenig@amd.com Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Tue, Jun 14, 2022 at 10:04:00AM +0300, Lionel Landwerlin wrote:

...
On 13/06/2022 21:02, Niranjana Vishwanathapura wrote:

...
On Mon, Jun 13, 2022 at 06:33:07AM -0700, Zeng, Oak wrote:

...
Regards, Oak

...
-----Original Message----- From: Intel-gfx intel-gfx-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: June 10, 2022 1:43 PM To: Landwerlin, Lionel G lionel.g.landwerlin@intel.com Cc: Intel GFX intel-gfx@lists.freedesktop.org; Maling list - DRI developers <dri- devel@lists.freedesktop.org>; Hellstrom, Thomas thomas.hellstrom@intel.com; Wilson, Chris P chris.p.wilson@intel.com; Vetter, Daniel daniel.vetter@intel.com; Christian König

christian.koenig@amd.com

...
...
...
...
Subject: Re: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Fri, Jun 10, 2022 at 11:18:14AM +0300, Lionel Landwerlin wrote: >On 10/06/2022 10:54, Niranjana Vishwanathapura wrote: >>On Fri, Jun 10, 2022 at 09:53:24AM +0300, Lionel Landwerlin wrote: >>>On 09/06/2022 22:31, Niranjana Vishwanathapura wrote: >>>>On Thu, Jun 09, 2022 at 05:49:09PM +0300, Lionel Landwerlin

wrote:

...
...
...
...
...
>>>>> On 09/06/2022 00:55, Jason Ekstrand wrote: >>>>> >>>>>    On Wed, Jun 8, 2022 at 4:44 PM Niranjana Vishwanathapura >>>>> niranjana.vishwanathapura@intel.com wrote: >>>>> >>>>>      On Wed, Jun 08, 2022 at 08:33:25AM +0100, Tvrtko Ursulin wrote: >>>>>      > >>>>>      > >>>>>      >On 07/06/2022 22:32, Niranjana Vishwanathapura wrote: >>>>>      >>On Tue, Jun 07, 2022 at 11:18:11AM -0700, Niranjana >>>>>Vishwanathapura >>>>>      wrote: >>>>>      >>>On Tue, Jun 07, 2022 at 12:12:03PM -0500, Jason >>>>>Ekstrand wrote: >>>>>      >>>> On Fri, Jun 3, 2022 at 6:52 PM Niranjana Vishwanathapura >>>>>      >>>> niranjana.vishwanathapura@intel.com wrote: >>>>>      >>>> >>>>>      >>>>   On Fri, Jun 03, 2022 at 10:20:25AM +0300, Lionel >>>>>Landwerlin >>>>>      wrote: >>>>>      >>>>   >   On 02/06/2022 23:35, Jason Ekstrand wrote: >>>>>      >>>>   > >>>>>      >>>>   >     On Thu, Jun 2, 2022 at 3:11 PM Niranjana >>>>>Vishwanathapura >>>>>      >>>>   > niranjana.vishwanathapura@intel.com wrote: >>>>>      >>>>   > >>>>>      >>>>   >       On Wed, Jun 01, 2022 at 01:28:36PM -0700, Matthew >>>>>      >>>>Brost wrote: >>>>>      >>>>   >       >On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel >>>>>      Landwerlin >>>>>      >>>>   wrote: >>>>>      >>>>   > >> On 17/05/2022 21:32, Niranjana

Vishwanathapura

...
...
...
...
...
>>>>>      wrote: >>>>>      >>>>   > >> > +VM_BIND/UNBIND ioctl will immediately start >>>>>      >>>>   binding/unbinding >>>>>      >>>>   >       the mapping in an >>>>>      >>>>   > >> > +async worker. The binding and >>>>>unbinding will >>>>>      >>>>work like a >>>>>      >>>>   special >>>>>      >>>>   >       GPU engine. >>>>>      >>>>   > >> > +The binding and unbinding operations are >>>>>      serialized and >>>>>      >>>>   will >>>>>      >>>>   >       wait on specified >>>>>      >>>>   > >> > +input fences before the operation >>>>>and will signal >>>>>      the >>>>>      >>>>   output >>>>>      >>>>   >       fences upon the >>>>>      >>>>   > >> > +completion of the operation. Due to >>>>>      serialization, >>>>>      >>>>   completion of >>>>>      >>>>   >       an operation >>>>>      >>>>   > >> > +will also indicate that all >>>>>previous operations >>>>>      >>>>are also >>>>>      >>>>   > complete. >>>>>      >>>>   > >> >>>>>      >>>>   > >> I guess we should avoid saying "will >>>>>immediately >>>>>      start >>>>>      >>>>   > binding/unbinding" if >>>>>      >>>>   > >> there are fences involved. >>>>>      >>>>   > >> >>>>>      >>>>   > >> And the fact that it's happening in an async >>>>>      >>>>worker seem to >>>>>      >>>>   imply >>>>>      >>>>   >       it's not >>>>>      >>>>   > >> immediate. >>>>>      >>>>   > >> >>>>>      >>>>   > >>>>>      >>>>   >       Ok, will fix. >>>>>      >>>>   >       This was added because in earlier design >>>>>binding was >>>>>      deferred >>>>>      >>>>   until >>>>>      >>>>   >       next execbuff. >>>>>      >>>>   >       But now it is non-deferred (immediate in >>>>>that sense). >>>>>      >>>>But yah, >>>>>      >>>>   this is >>>>>      >>>>   > confusing >>>>>      >>>>   >       and will fix it. >>>>>      >>>>   > >>>>>      >>>>   > >> >>>>>      >>>>   > >> I have a question on the behavior of the bind >>>>>      >>>>operation when >>>>>      >>>>   no >>>>>      >>>>   >       input fence >>>>>      >>>>   > >> is provided. Let say I do : >>>>>      >>>>   > >> >>>>>      >>>>   > >> VM_BIND (out_fence=fence1) >>>>>      >>>>   > >> >>>>>      >>>>   > >> VM_BIND (out_fence=fence2) >>>>>      >>>>   > >> >>>>>      >>>>   > >> VM_BIND (out_fence=fence3) >>>>>      >>>>   > >> >>>>>      >>>>   > >> >>>>>      >>>>   > >> In what order are the fences going to >>>>>be signaled? >>>>>      >>>>   > >> >>>>>      >>>>   > >> In the order of VM_BIND ioctls? Or out >>>>>of order? >>>>>      >>>>   > >> >>>>>      >>>>   > >> Because you wrote "serialized I assume >>>>>it's : in >>>>>      order >>>>>      >>>>   > >> >>>>>      >>>>   > >>>>>      >>>>   >       Yes, in the order of VM_BIND/UNBIND >>>>>ioctls. Note that >>>>>      >>>>bind and >>>>>      >>>>   unbind >>>>>      >>>>   >       will use >>>>>      >>>>   >       the same queue and hence are ordered. >>>>>      >>>>   > >>>>>      >>>>   > >> >>>>>      >>>>   > >> One thing I didn't realize is that >>>>>because we only >>>>>      get one >>>>>      >>>>   > "VM_BIND" engine, >>>>>      >>>>   > >> there is a disconnect from the Vulkan >>>>>specification. >>>>>      >>>>   > >> >>>>>      >>>>   > >> In Vulkan VM_BIND operations are >>>>>serialized but >>>>>      >>>>per engine. >>>>>      >>>>   > >> >>>>>      >>>>   > >> So you could have something like this : >>>>>      >>>>   > >> >>>>>      >>>>   > >> VM_BIND (engine=rcs0, in_fence=fence1, >>>>>      out_fence=fence2) >>>>>      >>>>   > >> >>>>>      >>>>   > >> VM_BIND (engine=ccs0, in_fence=fence3, >>>>>      out_fence=fence4) >>>>>      >>>>   > >> >>>>>      >>>>   > >> >>>>>      >>>>   > >> fence1 is not signaled >>>>>      >>>>   > >> >>>>>      >>>>   > >> fence3 is signaled >>>>>      >>>>   > >> >>>>>      >>>>   > >> So the second VM_BIND will proceed before the >>>>>      >>>>first VM_BIND. >>>>>      >>>>   > >> >>>>>      >>>>   > >> >>>>>      >>>>   > >> I guess we can deal with that scenario in >>>>>      >>>>userspace by doing >>>>>      >>>>   the >>>>>      >>>>   >       wait >>>>>      >>>>   > >> ourselves in one thread per engines. >>>>>      >>>>   > >> >>>>>      >>>>   > >> But then it makes the VM_BIND input >>>>>fences useless. >>>>>      >>>>   > >> >>>>>      >>>>   > >> >>>>>      >>>>   > >> Daniel : what do you think? Should be >>>>>rework this or >>>>>      just >>>>>      >>>>   deal with >>>>>      >>>>   >       wait >>>>>      >>>>   > >> fences in userspace? >>>>>      >>>>   > >> >>>>>      >>>>   >       > >>>>>      >>>>   >       >My opinion is rework this but make the >>>>>ordering via >>>>>      >>>>an engine >>>>>      >>>>   param >>>>>      >>>>   > optional. >>>>>      >>>>   >       > >>>>>      >>>>   > >e.g. A VM can be configured so all binds >>>>>are ordered >>>>>      >>>>within the >>>>>      >>>>   VM >>>>>      >>>>   >       > >>>>>      >>>>   > >e.g. A VM can be configured so all binds >>>>>accept an >>>>>      engine >>>>>      >>>>   argument >>>>>      >>>>   >       (in >>>>>      >>>>   > >the case of the i915 likely this is a >>>>>gem context >>>>>      >>>>handle) and >>>>>      >>>>   binds >>>>>      >>>>   > >ordered with respect to that engine. >>>>>      >>>>   >       > >>>>>      >>>>   > >This gives UMDs options as the later >>>>>likely consumes >>>>>      >>>>more KMD >>>>>      >>>>   > resources >>>>>      >>>>   >       >so if a different UMD can live with binds being >>>>>      >>>>ordered within >>>>>      >>>>   the VM >>>>>      >>>>   > >they can use a mode consuming less resources. >>>>>      >>>>   >       > >>>>>      >>>>   > >>>>>      >>>>   >       I think we need to be careful here if we >>>>>are looking >>>>>      for some >>>>>      >>>>   out of >>>>>      >>>>   > (submission) order completion of vm_bind/unbind. >>>>>      >>>>   > In-order completion means, in a batch of >>>>>binds and >>>>>      >>>>unbinds to be >>>>>      >>>>   > completed in-order, user only needs to specify >>>>>      >>>>in-fence for the >>>>>      >>>>   >       first bind/unbind call and the our-fence >>>>>for the last >>>>>      >>>>   bind/unbind >>>>>      >>>>   >       call. Also, the VA released by an unbind >>>>>call can be >>>>>      >>>>re-used by >>>>>      >>>>   >       any subsequent bind call in that in-order batch. >>>>>      >>>>   > >>>>>      >>>>   >       These things will break if >>>>>binding/unbinding were to >>>>>      >>>>be allowed >>>>>      >>>>   to >>>>>      >>>>   >       go out of order (of submission) and user >>>>>need to be >>>>>      extra >>>>>      >>>>   careful >>>>>      >>>>   >       not to run into pre-mature triggereing of >>>>>out-fence and >>>>>      bind >>>>>      >>>>   failing >>>>>      >>>>   >       as VA is still in use etc. >>>>>      >>>>   > >>>>>      >>>>   >       Also, VM_BIND binds the provided mapping on the >>>>>      specified >>>>>      >>>>   address >>>>>      >>>>   >       space >>>>>      >>>>   >       (VM). So, the uapi is not engine/context >>>>>specific. >>>>>      >>>>   > >>>>>      >>>>   >       We can however add a 'queue' to the uapi >>>>>which can be >>>>>      >>>>one from >>>>>      >>>>   the >>>>>      >>>>   > pre-defined queues, >>>>>      >>>>   > I915_VM_BIND_QUEUE_0 >>>>>      >>>>   > I915_VM_BIND_QUEUE_1 >>>>>      >>>>   >       ... >>>>>      >>>>   > I915_VM_BIND_QUEUE_(N-1) >>>>>      >>>>   > >>>>>      >>>>   >       KMD will spawn an async work queue for >>>>>each queue which >>>>>      will >>>>>      >>>>   only >>>>>      >>>>   >       bind the mappings on that queue in the order of >>>>>      submission. >>>>>      >>>>   >       User can assign the queue to per engine >>>>>or anything >>>>>      >>>>like that. >>>>>      >>>>   > >>>>>      >>>>   >       But again here, user need to be careful and not >>>>>      >>>>deadlock these >>>>>      >>>>   >       queues with circular dependency of fences. >>>>>      >>>>   > >>>>>      >>>>   >       I prefer adding this later an as >>>>>extension based on >>>>>      >>>>whether it >>>>>      >>>>   >       is really helping with the implementation. >>>>>      >>>>   > >>>>>      >>>>   >     I can tell you right now that having >>>>>everything on a >>>>>      single >>>>>      >>>>   in-order >>>>>      >>>>   >     queue will not get us the perf we want. >>>>>What vulkan >>>>>      >>>>really wants >>>>>      >>>>   is one >>>>>      >>>>   >     of two things: >>>>>      >>>>   >      1. No implicit ordering of VM_BIND ops. They just >>>>>      happen in >>>>>      >>>>   whatever >>>>>      >>>>   >     their dependencies are resolved and we >>>>>ensure ordering >>>>>      >>>>ourselves >>>>>      >>>>   by >>>>>      >>>>   >     having a syncobj in the VkQueue. >>>>>      >>>>   >      2. The ability to create multiple VM_BIND >>>>>queues. We >>>>>      need at >>>>>      >>>>   least 2 >>>>>      >>>>   >     but I don't see why there needs to be a >>>>>limit besides >>>>>      >>>>the limits >>>>>      >>>>   the >>>>>      >>>>   >     i915 API already has on the number of >>>>>engines. Vulkan >>>>>      could >>>>>      >>>>   expose >>>>>      >>>>   >     multiple sparse binding queues to the >>>>>client if it's not >>>>>      >>>>   arbitrarily >>>>>      >>>>   >     limited. >>>>>      >>>> >>>>>      >>>>   Thanks Jason, Lionel. >>>>>      >>>> >>>>>      >>>>   Jason, what are you referring to when you say >>>>>"limits the i915 >>>>>      API >>>>>      >>>>   already >>>>>      >>>>   has on the number of engines"? I am not sure if >>>>>there is such >>>>>      an uapi >>>>>      >>>>   today. >>>>>      >>>> >>>>>      >>>> There's a limit of something like 64 total engines >>>>>today based on >>>>>      the >>>>>      >>>> number of bits we can cram into the exec flags in >>>>>execbuffer2. I >>>>>      think >>>>>      >>>> someone had an extended version that allowed more >>>>>but I ripped it >>>>>      out >>>>>      >>>> because no one was using it. Of course, >>>>>execbuffer3 might not >>>>>      >>>>have that >>>>>      >>>> problem at all. >>>>>      >>>> >>>>>      >>> >>>>>      >>>Thanks Jason. >>>>>      >>>Ok, I am not sure which exec flag is that, but yah, >>>>>execbuffer3 >>>>>      probably >>>>>      >>>will not have this limiation. So, we need to define a >>>>>      VM_BIND_MAX_QUEUE >>>>>      >>>and somehow export it to user (I am thinking of >>>>>embedding it in >>>>>      >>>I915_PARAM_HAS_VM_BIND. bits[0]->HAS_VM_BIND, bits[1-3]->'n' >>>>>      meaning 2^n >>>>>      >>>queues. >>>>>      >> >>>>>      >>Ah, I think you are waking about I915_EXEC_RING_MASK >>>>>(0x3f) which >>>>>      execbuf3 >>>>> >>>>>    Yup! That's exactly the limit I was talking about. >>>>> >>>>>      >>will also have. So, we can simply define in

vm_bind/unbind

...
...
...
...
...
>>>>>      structures, >>>>>      >> >>>>>      >>#define I915_VM_BIND_MAX_QUEUE   64 >>>>>      >>        __u32 queue; >>>>>      >> >>>>>      >>I think that will keep things simple. >>>>>      > >>>>>      >Hmmm? What does execbuf2 limit has to do with how many engines >>>>>      >hardware can have? I suggest not to do that. >>>>>      > >>>>>      >Change with added this: >>>>>      > >>>>>      >       if (set.num_engines > I915_EXEC_RING_MASK + 1) >>>>>      >               return -EINVAL; >>>>>      > >>>>>      >To context creation needs to be undone and so let users >>>>>create engine >>>>>      >maps with all hardware engines, and let execbuf3 access >>>>>them all. >>>>>      > >>>>> >>>>>      Earlier plan was to carry I915_EXEC_RING_MAP (0x3f) to >>>>>execbuff3 also. >>>>>      Hence, I was using the same limit for VM_BIND queues >>>>>(64, or 65 if we >>>>>      make it N+1). >>>>>      But, as discussed in other thread of this RFC series, we >>>>>are planning >>>>>      to drop this I915_EXEC_RING_MAP in execbuff3. So, there won't be >>>>>      any uapi that limits the number of engines (and hence >>>>>the vm_bind >>>>>      queues >>>>>      need to be supported). >>>>> >>>>>      If we leave the number of vm_bind queues to be arbitrarily large >>>>>      (__u32 queue_idx) then, we need to have a hashmap for >>>>>queue (a wq, >>>>>      work_item and a linked list) lookup from the user >>>>>specified queue >>>>>      index. >>>>>      Other option is to just put some hard limit (say 64 or >>>>>65) and use >>>>>      an array of queues in VM (each created upon first use). >>>>>I prefer this. >>>>> >>>>>    I don't get why a VM_BIND queue is any different from any >>>>>other queue or >>>>>    userspace-visible kernel object. But I'll leave those >>>>>details up to >>>>>    danvet or whoever else might be reviewing the

implementation.

...
...
...
...
>>>>>    --Jason >>>>> >>>>> I kind of agree here. Wouldn't be simpler to have the bind >>>>>queue created >>>>> like the others when we build the engine map? >>>>> >>>>> For userspace it's then just matter of selecting the right >>>>>queue ID when >>>>> submitting. >>>>> >>>>> If there is ever a possibility to have this work on the GPU, >>>>>it would be >>>>> all ready. >>>>> >>>> >>>>I did sync offline with Matt Brost on this. >>>>We can add a VM_BIND engine class and let user create

VM_BIND

...
...
...
...
...
>>>>engines (queues). >>>>The problem is, in i915 engine creating interface is bound to >>>>gem_context. >>>>So, in vm_bind ioctl, we would need both context_id and >>>>queue_idx for proper >>>>lookup of the user created engine. This is bit ackward as vm_bind is an >>>>interface to VM (address space) and has nothing to do with gem_context. >>> >>> >>>A gem_context has a single vm object right? >>> >>>Set through I915_CONTEXT_PARAM_VM at creation or given a

default

...
...
...
...
>>>one if not. >>> >>>So it's just like picking up the vm like it's done at execbuffer >>>time right now : eb->context->vm >>> >> >>Are you suggesting replacing 'vm_id' with 'context_id' in the >>VM_BIND/UNBIND >>ioctl and probably call it CONTEXT_BIND/UNBIND, because VM can

be

...
...
...
...
>>obtained >>from the context? > > >Yes, because if we go for engines, they're associated with a context >and so also associated with the VM bound to the context. >

Hmm...context doesn't sould like the right interface. It should be VM and engine (independent of context). Engine can be virtual or soft engine (kernel thread), each with its own queue. We can add an interface to create such engines (independent of context). But we are anway implicitly creating it when user uses a new queue_idx. If in future we have hardware engines for VM_BIND operation, we can have that explicit inteface to create engine instances and the queue_index in vm_bind/unbind will point to those engines. Anyone has any thoughts? Daniel?

Exposing gem_context or intel_context to user space is a strange concept to me. A context represent some hw resources that is used to complete certain task. User space should care allocate some resources (memory, queues) and submit tasks to queues. But user space doesn't care how certain task is mapped to a HW context - driver/guc should take care of this.

So a cleaner interface to me is: user space create a vm, create gem object, vm_bind it to a vm; allocate queues (internally represent compute or blitter HW. Queue can be virtual to user) for this vm; submit tasks to queues. User can create multiple queues under one vm. One queue is only for one vm.

I915 driver/guc manage the hw compute or blitter resources which is transparent to user space. When i915 or guc decide to schedule a queue (run tasks on that queue), a HW engine will be pick up and set up properly for the vm of that queue (ie., switch to page tables of that vm) - this is a context switch.

From vm_bind perspective, it simply bind a gem_object to a vm. Engine/queue is not a parameter to vm_bind, as any engine can be pick up by i915/guc to execute a task using the vm bound va.

I didn't completely follow the discussion here. Just share some thoughts.

Yah, I agree.

Lionel, How about we define the queue as union {        __u32 queue_idx;        __u64 rsvd; }

If required, we can extend by expanding the 'rsvd' field to <ctx_id, queue_idx> later with a flag.

Niranjana

I did not really understand Oak's comment nor what you're suggesting here to be honest.

First the GEM context is already exposed to userspace. It's explicitly created by userpace with DRM_IOCTL_I915_GEM_CONTEXT_CREATE.

We give the GEM context id in every execbuffer we do with drm_i915_gem_execbuffer2::rsvd1.

It's still in the new execbuffer3 proposal being discussed.

Second, the GEM context is also where we set the VM with I915_CONTEXT_PARAM_VM.

Third, the GEM context also has the list of engines with I915_CONTEXT_PARAM_ENGINES.

Yes, the execbuf and engine map creation are tied to gem_context. (which probably is not the best interface.)

...
So it makes sense to me to dispatch the vm_bind operation to a GEM context, to a given vm_bind queue, because it's got all the information required :

- the list of new vm_bind queues

- the vm that is going to be modified

But the operation is performed here on the address space (VM) which can have multiple gem_contexts referring to it. So, VM is the right interface here. We need not 'gem_context'ify it.

All we need is multiple queue support for the address space (VM). Going to gem_context for that just because we have engine creation support there seems unnecessay and not correct to me.

...
Otherwise where do the vm_bind queues live?

In the i915/drm fd object?

That would mean that all the GEM contexts are sharing the same vm_bind queues.

Not all, only the gem contexts that are using the same address space (VM). But to me the right way to describe would be that "VM will be using those queues".

I hope by "queue" here you mean a HW resource that will be later used to execute the job, for example a ccs compute engine. Of course queue can be virtual so user can create more queues than what hw physically has.

To express the concept of "VM will be using those queues", I think it make sense to have create_queue(vm) function taking a vm parameter. This means this queue is created for the purpose of submit job under this VM. Later on, we can submit job (referring to objects vm_bound to the same vm) to the queue. The vm_bind ioctl doesn’t need to have queue parameter, just vm_bind (object, va, vm).

I hope the "queue" here is not the engine used to perform the vm_bind operation itself. But if you meant a queue/engine to perform vm_bind itself (vs a queue/engine for later job submission), then we can discuss more. I know xe driver have similar concept and I think align the design early can benefit the migration to xe driver.

Oops, I read more on this thread and it turned out the vm_bind queue here is actually used to perform vm bind/unbind operations. XE driver has the similar concept (except it is called engine_id there). So having a queue_idx parameter is closer to xe design.

That said, I still feel having a queue_idx parameter to vm_bind is a bit awkward. Vm_bind can be performed without any GPU engines, ie,. CPU itself can complete a vm bind as long as CPU have access to gpu's local memory. So the queue here have to be a virtual concept - it doesn't have a hard map to GPU blitter engine.

Can someone summarize what is the benefit of the queue-idx parameter? For the purpose of ordering vm_bind and later gpu jobs?

...

Regards, Oak

...
Niranjana

...
intel_context or GuC are internal details we're not concerned about.

I don't really see the connection with the GEM context.

Maybe Oak has a different use case than Vulkan.

-Lionel

...
...
Regards, Oak

...
Niranjana

> >>I think the interface is clean as a interface to VM. It is only that we >>don't have a clean way to create a raw VM_BIND engine (not >>associated with >>any context) with i915 uapi. >>May be we can add such an interface, but I don't think that is worth it >>(we might as well just use a queue_idx in VM_BIND/UNBIND ioctl

as I

...
...
...
...
...
>>mentioned >>above). >>Anyone has any thoughts? >> >>> >>>>Another problem is, if two VMs are binding with the same

defined

...
...
...
...
...
>>>>engine, >>>>binding on VM1 can get unnecessary blocked by binding on VM2 >>>>(which may be >>>>waiting on its in_fence). >>> >>> >>>Maybe I'm missing something, but how can you have 2 vm objects >>>with a single gem_context right now? >>> >> >>No, we don't have 2 VMs for a gem_context. >>Say if ctx1 with vm1 and ctx2 with vm2. >>First vm_bind call was for vm1 with q_idx 1 in ctx1 engine map. >>Second vm_bind call was for vm2 with q_idx 2 in ctx2 engine map. If >>those two queue indicies points to same underlying vm_bind

engine,

...
...
...
...
...
>>then the second vm_bind call gets blocked until the first vm_bind call's >>'in' fence is triggered and bind completes. >> >>With per VM queues, this is not a problem as two VMs will not

endup

...
...
...
...
...
>>sharing same queue. >> >>BTW, I just posted a updated PATCH series. >>https://www.spinics.net/lists/dri-devel/msg350483.html >> >>Niranjana >> >>> >>>> >>>>So, my preference here is to just add a 'u32 queue' index in >>>>vm_bind/unbind >>>>ioctl, and the queues are per VM. >>>> >>>>Niranjana >>>> >>>>> Thanks, >>>>> >>>>> -Lionel >>>>> >>>>> >>>>>      Niranjana >>>>> >>>>>      >Regards, >>>>>      > >>>>>      >Tvrtko >>>>>      > >>>>>      >> >>>>>      >>Niranjana >>>>>      >> >>>>>      >>> >>>>>      >>>>   I am trying to see how many queues we need and >>>>>don't want it to >>>>>      be >>>>>      >>>>   arbitrarily >>>>>      >>>>   large and unduely blow up memory usage and >>>>>complexity in i915 >>>>>      driver. >>>>>      >>>> >>>>>      >>>> I expect a Vulkan driver to use at most 2 in the >>>>>vast majority >>>>>      >>>>of cases. I >>>>>      >>>> could imagine a client wanting to create more than 1 sparse >>>>>      >>>>queue in which >>>>>      >>>> case, it'll be N+1 but that's unlikely. As far as >>>>>complexity >>>>>      >>>>goes, once >>>>>      >>>> you allow two, I don't think the complexity is going up by >>>>>      >>>>allowing N. As >>>>>      >>>> for memory usage, creating more queues means more >>>>>memory. That's >>>>>      a >>>>>      >>>> trade-off that userspace can make. Again, the >>>>>expected number >>>>>      >>>>here is 1 >>>>>      >>>> or 2 in the vast majority of cases so I don't think >>>>>you need to >>>>>      worry. >>>>>      >>> >>>>>      >>>Ok, will start with n=3 meaning 8 queues. >>>>>      >>>That would require us create 8 workqueues. >>>>>      >>>We can change 'n' later if required. >>>>>      >>> >>>>>      >>>Niranjana >>>>>      >>> >>>>>      >>>> >>>>>      >>>>   >     Why? Because Vulkan has two basic kind of bind >>>>>      >>>>operations and we >>>>>      >>>>   don't >>>>>      >>>>   >     want any dependencies between them: >>>>>      >>>>   >      1. Immediate. These happen right after BO >>>>>creation or >>>>>      >>>>maybe as >>>>>      >>>>   part of >>>>>      >>>>   > vkBindImageMemory() or

VkBindBufferMemory(). These

...
...
...
...
>>>>>      >>>>don't happen >>>>>      >>>>   on a >>>>>      >>>>   >     queue and we don't want them serialized >>>>>with anything.       To >>>>>      >>>>   synchronize >>>>>      >>>>   >     with submit, we'll have a syncobj in the >>>>>VkDevice which >>>>>      is >>>>>      >>>>   signaled by >>>>>      >>>>   >     all immediate bind operations and make >>>>>submits wait on >>>>>      it. >>>>>      >>>>   >      2. Queued (sparse): These happen on a >>>>>VkQueue which may >>>>>      be the >>>>>      >>>>   same as >>>>>      >>>>   >     a render/compute queue or may be its own >>>>>queue. It's up >>>>>      to us >>>>>      >>>>   what we >>>>>      >>>>   >     want to advertise. From the Vulkan API >>>>>PoV, this is like >>>>>      any >>>>>      >>>>   other >>>>>      >>>>   >     queue. Operations on it wait on and signal >>>>>semaphores.       If we >>>>>      >>>>   have a >>>>>      >>>>   >     VM_BIND engine, we'd provide syncobjs to wait and >>>>>      >>>>signal just like >>>>>      >>>>   we do >>>>>      >>>>   >     in execbuf(). >>>>>      >>>>   >     The important thing is that we don't want >>>>>one type of >>>>>      >>>>operation to >>>>>      >>>>   block >>>>>      >>>>   >     on the other. If immediate binds are >>>>>blocking on sparse >>>>>      binds, >>>>>      >>>>   it's >>>>>      >>>>   >     going to cause over-synchronization issues. >>>>>      >>>>   >     In terms of the internal implementation, I >>>>>know that >>>>>      >>>>there's going >>>>>      >>>>   to be >>>>>      >>>>   >     a lock on the VM and that we can't actually >>>>>do these >>>>>      things in >>>>>      >>>>   > parallel. That's fine. Once the dma_fences have >>>>>      signaled and >>>>>      >>>>   we're >>>>>      >>>> >>>>>      >>>>   Thats correct. It is like a single VM_BIND engine with >>>>>      >>>>multiple queues >>>>>      >>>>   feeding to it. >>>>>      >>>> >>>>>      >>>> Right. As long as the queues themselves are >>>>>independent and >>>>>      >>>>can block on >>>>>      >>>> dma_fences without holding up other queues, I think >>>>>we're fine. >>>>>      >>>> >>>>>      >>>>   > unblocked to do the bind operation, I don't care if >>>>>      >>>>there's a bit >>>>>      >>>>   of >>>>>      >>>>   > synchronization due to locking. That's >>>>>expected. What >>>>>      >>>>we can't >>>>>      >>>>   afford >>>>>      >>>>   >     to have is an immediate bind operation >>>>>suddenly blocking >>>>>      on a >>>>>      >>>>   sparse >>>>>      >>>>   > operation which is blocked on a compute job >>>>>that's going >>>>>      to run >>>>>      >>>>   for >>>>>      >>>>   >     another 5ms. >>>>>      >>>> >>>>>      >>>>   As the VM_BIND queue is per VM, VM_BIND on one

VM

...
...
...
...
>>>>>doesn't block >>>>>      the >>>>>      >>>>   VM_BIND >>>>>      >>>>   on other VMs. I am not sure about usecases here, but just >>>>>      wanted to >>>>>      >>>>   clarify. >>>>>      >>>> >>>>>      >>>> Yes, that's what I would expect. >>>>>      >>>> --Jason >>>>>      >>>> >>>>>      >>>>   Niranjana >>>>>      >>>> >>>>>      >>>>   >     For reference, Windows solves this by allowing >>>>>      arbitrarily many >>>>>      >>>>   paging >>>>>      >>>>   >     queues (what they call a VM_BIND >>>>>engine/queue). That >>>>>      >>>>design works >>>>>      >>>>   >     pretty well and solves the problems in >>>>>question.       >>>>Again, we could >>>>>      >>>>   just >>>>>      >>>>   >     make everything out-of-order and require >>>>>using syncobjs >>>>>      >>>>to order >>>>>      >>>>   things >>>>>      >>>>   >     as userspace wants. That'd be fine too. >>>>>      >>>>   >     One more note while I'm here: danvet said >>>>>something on >>>>>      >>>>IRC about >>>>>      >>>>   VM_BIND >>>>>      >>>>   >     queues waiting for syncobjs to >>>>>materialize. We don't >>>>>      really >>>>>      >>>>   want/need >>>>>      >>>>   >     this. We already have all the machinery in >>>>>userspace to >>>>>      handle >>>>>      >>>>   > wait-before-signal and waiting for syncobj >>>>>fences to >>>>>      >>>>materialize >>>>>      >>>>   and >>>>>      >>>>   >     that machinery is on by default. It would actually >>>>>      >>>>take MORE work >>>>>      >>>>   in >>>>>      >>>>   >     Mesa to turn it off and take advantage of >>>>>the kernel >>>>>      >>>>being able to >>>>>      >>>>   wait >>>>>      >>>>   >     for syncobjs to materialize. Also, getting >>>>>that right is >>>>>      >>>>   ridiculously >>>>>      >>>>   >     hard and I really don't want to get it >>>>>wrong in kernel >>>>>      >>>>space.   �� When we >>>>>      >>>>   >     do memory fences, wait-before-signal will >>>>>be a thing. We >>>>>      don't >>>>>      >>>>   need to >>>>>      >>>>   >     try and make it a thing for syncobj. >>>>>      >>>>   >     --Jason >>>>>      >>>>   > >>>>>      >>>>   >   Thanks Jason, >>>>>      >>>>   > >>>>>      >>>>   >   I missed the bit in the Vulkan spec that >>>>>we're allowed to >>>>>      have a >>>>>      >>>>   sparse >>>>>      >>>>   >   queue that does not implement either graphics >>>>>or compute >>>>>      >>>>operations >>>>>      >>>>   : >>>>>      >>>>   > >>>>>      >>>>   >     "While some implementations may include >>>>>      >>>> VK_QUEUE_SPARSE_BINDING_BIT >>>>>      >>>>   >     support in queue families that also include >>>>>      >>>>   > >>>>>      >>>>   > graphics and compute support, other >>>>>implementations may >>>>>      only >>>>>      >>>>   expose a >>>>>      >>>>   > VK_QUEUE_SPARSE_BINDING_BIT-only queue >>>>>      >>>>   > >>>>>      >>>>   > family." >>>>>      >>>>   > >>>>>      >>>>   >   So it can all be all a vm_bind engine that just does >>>>>      bind/unbind >>>>>      >>>>   > operations. >>>>>      >>>>   > >>>>>      >>>>   >   But yes we need another engine for the >>>>>immediate/non-sparse >>>>>      >>>>   operations. >>>>>      >>>>   > >>>>>      >>>>   >   -Lionel >>>>>      >>>>   > >>>>>      >>>>   >         > >>>>>      >>>>   > Daniel, any thoughts? >>>>>      >>>>   > >>>>>      >>>>   > Niranjana >>>>>      >>>>   > >>>>>      >>>>   > >Matt >>>>>      >>>>   >       > >>>>>      >>>>   > >> >>>>>      >>>>   > >> Sorry I noticed this late. >>>>>      >>>>   > >> >>>>>      >>>>   > >> >>>>>      >>>>   > >> -Lionel >>>>>      >>>>   > >> >>>>>      >>>>   > >> >>> >>> >

Matthew Brost

1 Jun 1 Jun

9:18 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:

...

On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:

...
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.

I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.

And the fact that it's happening in an async worker seem to imply it's not immediate.

I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :

VM_BIND (out_fence=fence1)

VM_BIND (out_fence=fence2)

VM_BIND (out_fence=fence3)

In what order are the fences going to be signaled?

In the order of VM_BIND ioctls? Or out of order?

Because you wrote "serialized I assume it's : in order

One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.

In Vulkan VM_BIND operations are serialized but per engine.

So you could have something like this :

VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

Question - let's say this done after the above operations:

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)

Is the exec ordered with respected to bind (i.e. would fence3 & 4 be signaled before the exec starts)?

Matt

...

fence1 is not signaled

fence3 is signaled

So the second VM_BIND will proceed before the first VM_BIND.

I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.

But then it makes the VM_BIND input fences useless.

Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?

Sorry I noticed this late.

-Lionel

Lionel Landwerlin

2 Jun 2 Jun

5:42 a.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On 02/06/2022 00:18, Matthew Brost wrote:

...

On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:

...
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:

...
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.

I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.

And the fact that it's happening in an async worker seem to imply it's not immediate.

I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :

VM_BIND (out_fence=fence1)

VM_BIND (out_fence=fence2)

VM_BIND (out_fence=fence3)

In what order are the fences going to be signaled?

In the order of VM_BIND ioctls? Or out of order?

Because you wrote "serialized I assume it's : in order

One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.

In Vulkan VM_BIND operations are serialized but per engine.

So you could have something like this :

VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

Question - let's say this done after the above operations:

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)

Is the exec ordered with respected to bind (i.e. would fence3 & 4 be signaled before the exec starts)?

Matt

Hi Matt,

From the vulkan point of view, everything is serialized within an engine (we map that to a VkQueue).

So with :

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

EXEC completes first then VM_BIND executes.

To be even clearer :

EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

EXEC will wait until fence2 is signaled. Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.

It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.

-Lionel

...

...
fence1 is not signaled

fence3 is signaled

So the second VM_BIND will proceed before the first VM_BIND.

I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.

But then it makes the VM_BIND input fences useless.

Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?

Sorry I noticed this late.

-Lionel

Matthew Brost

4:22 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Thu, Jun 02, 2022 at 08:42:13AM +0300, Lionel Landwerlin wrote:

...

On 02/06/2022 00:18, Matthew Brost wrote:

...
On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:

...
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:

...
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.

I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.

And the fact that it's happening in an async worker seem to imply it's not immediate.

I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :

VM_BIND (out_fence=fence1)

VM_BIND (out_fence=fence2)

VM_BIND (out_fence=fence3)

In what order are the fences going to be signaled?

In the order of VM_BIND ioctls? Or out of order?

Because you wrote "serialized I assume it's : in order

One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.

In Vulkan VM_BIND operations are serialized but per engine.

So you could have something like this :

VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

Question - let's say this done after the above operations:

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)

Is the exec ordered with respected to bind (i.e. would fence3 & 4 be signaled before the exec starts)?

Matt

Hi Matt,

From the vulkan point of view, everything is serialized within an engine (we map that to a VkQueue).

So with :

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

EXEC completes first then VM_BIND executes.

To be even clearer :

EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

EXEC will wait until fence2 is signaled. Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.

It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.

Yea this makes sense. I think of VM_BINDs as more or less just another version of an EXEC and this fits with that.

In practice I don't think we can share a ring but we should be able to present an engine (again likely a gem context in i915) to the user that orders VM_BINDs / EXECs if that is what Vulkan expects, at least I think.

Hopefully Niranjana + Daniel agree.

Matt

...

-Lionel

...
...
fence1 is not signaled

fence3 is signaled

So the second VM_BIND will proceed before the first VM_BIND.

I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.

But then it makes the VM_BIND input fences useless.

Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?

Sorry I noticed this late.

-Lionel

Niranjana Vishwanathapura

8:24 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Thu, Jun 02, 2022 at 09:22:46AM -0700, Matthew Brost wrote:

...

On Thu, Jun 02, 2022 at 08:42:13AM +0300, Lionel Landwerlin wrote:

...
On 02/06/2022 00:18, Matthew Brost wrote:

...
On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:

...
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:

...
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.

I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.

And the fact that it's happening in an async worker seem to imply it's not immediate.

I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :

VM_BIND (out_fence=fence1)

VM_BIND (out_fence=fence2)

VM_BIND (out_fence=fence3)

In what order are the fences going to be signaled?

In the order of VM_BIND ioctls? Or out of order?

Because you wrote "serialized I assume it's : in order

One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.

In Vulkan VM_BIND operations are serialized but per engine.

So you could have something like this :

VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

Question - let's say this done after the above operations:

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)

Is the exec ordered with respected to bind (i.e. would fence3 & 4 be signaled before the exec starts)?

Matt

Hi Matt,

From the vulkan point of view, everything is serialized within an engine (we map that to a VkQueue).

So with :

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

EXEC completes first then VM_BIND executes.

To be even clearer :

EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

EXEC will wait until fence2 is signaled. Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.

It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.

Yea this makes sense. I think of VM_BINDs as more or less just another version of an EXEC and this fits with that.

Note that VM_BIND itself can bind while and EXEC (GPU job) is running. (Say, getting binds ready for next submission). It is up to user though, how to use it.

...

In practice I don't think we can share a ring but we should be able to present an engine (again likely a gem context in i915) to the user that orders VM_BINDs / EXECs if that is what Vulkan expects, at least I think.

I have responded in the other thread on this.

Niranjana

...

Hopefully Niranjana + Daniel agree.

Matt

...
-Lionel

...
...
fence1 is not signaled

fence3 is signaled

So the second VM_BIND will proceed before the first VM_BIND.

I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.

But then it makes the VM_BIND input fences useless.

Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?

Sorry I noticed this late.

-Lionel

Bas Nieuwenhuizen

8:16 p.m.

New subject: [Intel-gfx] [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Thu, Jun 2, 2022 at 7:42 AM Lionel Landwerlin lionel.g.landwerlin@intel.com wrote:

...

On 02/06/2022 00:18, Matthew Brost wrote:

...
On Wed, Jun 01, 2022 at 05:25:49PM +0300, Lionel Landwerlin wrote:

...
On 17/05/2022 21:32, Niranjana Vishwanathapura wrote:

...
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.

I guess we should avoid saying "will immediately start binding/unbinding" if there are fences involved.

And the fact that it's happening in an async worker seem to imply it's not immediate.

I have a question on the behavior of the bind operation when no input fence is provided. Let say I do :

VM_BIND (out_fence=fence1)

VM_BIND (out_fence=fence2)

VM_BIND (out_fence=fence3)

In what order are the fences going to be signaled?

In the order of VM_BIND ioctls? Or out of order?

Because you wrote "serialized I assume it's : in order

One thing I didn't realize is that because we only get one "VM_BIND" engine, there is a disconnect from the Vulkan specification.

Note that in Vulkan not every queue has to support sparse binding, so one could consider a dedicated sparse binding only queue family.

...

...
...
In Vulkan VM_BIND operations are serialized but per engine.

So you could have something like this :

VM_BIND (engine=rcs0, in_fence=fence1, out_fence=fence2)

VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

Question - let's say this done after the above operations:

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL)

Is the exec ordered with respected to bind (i.e. would fence3 & 4 be signaled before the exec starts)?

Matt

Hi Matt,

From the vulkan point of view, everything is serialized within an engine (we map that to a VkQueue).

So with :

EXEC (engine=ccs0, in_fence=NULL, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

EXEC completes first then VM_BIND executes.

To be even clearer :

EXEC (engine=ccs0, in_fence=fence2, out_fence=NULL) VM_BIND (engine=ccs0, in_fence=fence3, out_fence=fence4)

EXEC will wait until fence2 is signaled. Once fence2 is signaled, EXEC proceeds, finishes and only after it is done, VM_BIND executes.

It would kind of like having the VM_BIND operation be another batch executed from the ringbuffer buffer.

-Lionel

...
...
fence1 is not signaled

fence3 is signaled

So the second VM_BIND will proceed before the first VM_BIND.

I guess we can deal with that scenario in userspace by doing the wait ourselves in one thread per engines.

But then it makes the VM_BIND input fences useless.

I posed the same question on my series for AMD (https://patchwork.freedesktop.org/series/104578/), albeit for slightly different reasons.: if one creates a new VkMemory object, you generally want that mapped ASAP, as you can't track (in a VK_KHR_descriptor_indexing world) whether the next submit is going to use this VkMemory object and hence have to assume the worst (i.e. wait till the map/bind is complete before executing the next submission). If all binds/unbinds (or maps/unmaps) happen in-order that means an operation with input fences could delay stuff we want ASAP.

Of course waiting in userspace does have disadvantages:

1) more overhead between fence signalling and the operation, potentially causing slightly bigger GPU bubbles. 2) You can't get an out fence early. Within the driver we can mostly work around this but sync_fd exports, WSI and such will be messy. 3) moving the queue to a thread might make things slightly less ideal due to scheduling delays.

Removing the in-order working in the kernel generally seems like madness to me as it is very hard to keep track of the state of the virtual address space (to e.g. track umapping stuff before freeing memory or moving memory around)

the one game I tried (FH5 over vkd3d-proton) does sparse mapping as follows:

separate queue: 1) 0 cmdbuffer submit with 0 input semaphores and 1 output semaphore 2) sparse bind with input semaphore from 1 and 1 output semaphore 3) 0 cmdbuffer submit with input semaphore from 2 and 1 output fence 4) wait on that fence on the CPU

which works very well if we just wait for the sparse bind input semaphore in userspace, but I'm still working on seeing if this is the common usecase or an outlier.

...

...
...
Daniel : what do you think? Should be rework this or just deal with wait fences in userspace?

Sorry I noticed this late.

-Lionel

Zeng, Oak

2:13 a.m.

New subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

Regards, Oak

...

-----Original Message----- From: dri-devel dri-devel-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: May 17, 2022 2:32 PM To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter, Daniel daniel.vetter@intel.com Cc: Brost, Matthew matthew.brost@intel.com; Hellstrom, Thomas thomas.hellstrom@intel.com; jason@jlekstrand.net; Wilson, Chris P chris.p.wilson@intel.com; christian.koenig@amd.com Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

VM_BIND design document with description of intended use cases.

v2: Add more documentation and format as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura

niranjana.vishwanathapura@intel.com

Documentation/driver-api/dma-buf.rst | 2 + Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 3 files changed, 310 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver- api/dma-buf.rst index 36a76cbe9095..64cb924ec5bb 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File .. kernel-doc:: include/linux/sync_file.h :internal:

+.. _indefinite_dma_fences:

Indefinite DMA Fences
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
b/Documentation/gpu/rfc/i915_vm_bind.rst
new file mode 100644
index 000000000000..f1be560d313c
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
@@ -0,0 +1,304 @@
+==========================================
+I915 VM_BIND feature design and use cases
+==========================================
+
+VM_BIND feature
+================
+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
buffer
+objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
+specified address space (VM). These mappings (also referred to as persistent
+mappings) will be persistent across multiple GPU submissions (execbuff calls)
+issued by the UMD, without user having to provide a list of all required
+mappings during each submission (as required by older execbuff mode).
+
+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
+to specify how the binding/unbinding should sync with other operations
+like the GPU job submission. These fences will be timeline 'drm_syncobj's
+for non-Compute contexts (See struct
drm_i915_vm_bind_ext_timeline_fences).
+For Compute contexts, they will be user/memory fences (See struct
+drm_i915_vm_bind_ext_user_fence).
+
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
+User has to opt-in for VM_BIND mode of binding for an address space (VM)
+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
extension.
+
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in
an
+async worker. The binding and unbinding will work like a special GPU engine.
+The binding and unbinding operations are serialized and will wait on specified
+input fences before the operation and will signal the output fences upon the
+completion of the operation. Due to serialization, completion of an operation
+will also indicate that all previous operations are also complete.

Hi,

Is user required to wait for the out fence be signaled before submit a gpu job using the vm_bind address? Or is user required to order the gpu job to make gpu job run after vm_bind out fence signaled?

I think there could be different behavior on a non-faultable platform and a faultable platform, such as on a non-faultable Platform, gpu job is required to be order after vm_bind out fence signaling; and on a faultable platform, there is no such Restriction since vm bind can be finished in the fault handler?

Should we document such thing?

Regards, Oak

...

+VM_BIND features include:

+* Multiple Virtual Address (VA) mappings can map to the same physical pages

of an object (aliasing).

+* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some

use cases will be helpful.

+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this).

+Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff mode of +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence, +no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases:

+"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/)

+This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

+If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later.

+In VM_BIND mode, VA allocation is completely managed by the user instead of +the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_).

+So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up +by clearly separating out the functionalities where the VM_BIND mode differs +from older method and they should be moved to separate files.

+VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.

+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.

+VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode, the +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page faults manages +residency of backing storage using dma_fence. The VM_BIND mode with page faults +and the system allocator support do not use any dma_fence at all.

+VM_BIND locking order is as below.

+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in

vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the

mapping.

In future, when GPU page faults are supported, we can potentially use a

rwsem instead, so that multiple page fault handlers can take the read side

lock to lookup the mapping and hence can run in parallel.

The older execbuff mode of binding do not need this lock.

+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to

be held while binding/unbinding a vma in the async worker and while updating

dma-resv fence list of an object. Note that private BOs of a VM will all

share a dma-resv object.

The future system allocator support will use the HMM prescribed locking

instead.

+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of

invalidated vmas (due to eviction and userptr invalidation) etc.

+When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we need to +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for +system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm lists, +so won't ever need lock-C.

+VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid +performance degradation. We will also need support for bulk LRU movement of +VM_BIND objects to avoid additional latencies in execbuff path.

+The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call with +that VM). So, bulk LRU movement of page table pages is also needed.

+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture.

+VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency +setting (either through explicit or implicit mechanism).

+When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff submission.

+Also, in VM_BIND mode, use dma-resv apis for determining object activeness +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for igfx.

+Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated).

+Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful.

+VM_BIND Compute support +========================

+User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl.

+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.

+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.

+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.

+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/

+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during +context creation. The dma-fence based user interfaces like gem_wait ioctl and +execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode.

+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption.

+As this support for context suspension using a preempt fence and the resume work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute mode +contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment.

+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the directly +submitted jobs.

+Other VM_BIND use cases +========================

+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debugged) and attached +to GPU via vm_bind interface.

+GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of +binding will require using dma-fence to ensure residency, the GPU page faults +mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries.

+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.

+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.

+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled.

+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +use cases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.

+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature

do not use it and complexity it brings in is probably more than the

performance advantage we get in legacy execbuff case.

+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using

it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma

is active or not.

+VM_BIND UAPI +=============

+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
 i915_scheduler.rst
+.. toctree::

i915_vm_bind.rst

-- 2.21.0.rc0.32.g243a4c7e27

Niranjana Vishwanathapura

8:48 p.m.

New subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:

...

Regards, Oak

...
-----Original Message----- From: dri-devel dri-devel-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: May 17, 2022 2:32 PM To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter, Daniel daniel.vetter@intel.com Cc: Brost, Matthew matthew.brost@intel.com; Hellstrom, Thomas thomas.hellstrom@intel.com; jason@jlekstrand.net; Wilson, Chris P chris.p.wilson@intel.com; christian.koenig@amd.com Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

VM_BIND design document with description of intended use cases.

v2: Add more documentation and format as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura

niranjana.vishwanathapura@intel.com

Documentation/driver-api/dma-buf.rst | 2 + Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 3 files changed, 310 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver- api/dma-buf.rst index 36a76cbe9095..64cb924ec5bb 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File .. kernel-doc:: include/linux/sync_file.h :internal:

+.. _indefinite_dma_fences:

Indefinite DMA Fences
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
b/Documentation/gpu/rfc/i915_vm_bind.rst
new file mode 100644
index 000000000000..f1be560d313c
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
@@ -0,0 +1,304 @@
+==========================================
+I915 VM_BIND feature design and use cases
+==========================================
+
+VM_BIND feature
+================
+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
buffer
+objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
+specified address space (VM). These mappings (also referred to as persistent
+mappings) will be persistent across multiple GPU submissions (execbuff calls)
+issued by the UMD, without user having to provide a list of all required
+mappings during each submission (as required by older execbuff mode).
+
+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
+to specify how the binding/unbinding should sync with other operations
+like the GPU job submission. These fences will be timeline 'drm_syncobj's
+for non-Compute contexts (See struct
drm_i915_vm_bind_ext_timeline_fences).
+For Compute contexts, they will be user/memory fences (See struct
+drm_i915_vm_bind_ext_user_fence).
+
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
+User has to opt-in for VM_BIND mode of binding for an address space (VM)
+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
extension.
+
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the mapping in
an
+async worker. The binding and unbinding will work like a special GPU engine.
+The binding and unbinding operations are serialized and will wait on specified
+input fences before the operation and will signal the output fences upon the
+completion of the operation. Due to serialization, completion of an operation
+will also indicate that all previous operations are also complete.
Hi,

Is user required to wait for the out fence be signaled before submit a gpu job using the vm_bind address? Or is user required to order the gpu job to make gpu job run after vm_bind out fence signaled?

Thanks Oak, Either should be fine and up to user how to use vm_bind/unbind out-fence.

...

I think there could be different behavior on a non-faultable platform and a faultable platform, such as on a non-faultable Platform, gpu job is required to be order after vm_bind out fence signaling; and on a faultable platform, there is no such Restriction since vm bind can be finished in the fault handler?

With GPU page faults handler, out fence won't be needed as residency is purely managed by page fault handler populating page tables (there is a mention of it in GPU Page Faults section below).

...

Should we document such thing?

We don't talk much about GPU page faults case in this document as that may warrent a separate rfc when we add page faults support. We did mention it in couple places to ensure our locking design here is extensible to gpu page faults case.

Niranjana

...

Regards, Oak

...
+VM_BIND features include:

+* Multiple Virtual Address (VA) mappings can map to the same physical pages

of an object (aliasing).

+* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some

use cases will be helpful.

+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this).

+Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff mode of +binding. In VM_BIND mode, execbuff ioctl will not accept any execlist. Hence, +no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases:

+"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/)

+This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

+If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later.

+In VM_BIND mode, VA allocation is completely managed by the user instead of +the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode will not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_).

+So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned up +by clearly separating out the functionalities where the VM_BIND mode differs +from older method and they should be moved to separate files.

+VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.

+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.

+VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode, the +newer VM_BIND mode, the VM_BIND mode with GPU page faults and possible future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page faults manages +residency of backing storage using dma_fence. The VM_BIND mode with page faults +and the system allocator support do not use any dma_fence at all.

+VM_BIND locking order is as below.

+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in

vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing the

mapping.

In future, when GPU page faults are supported, we can potentially use a

rwsem instead, so that multiple page fault handlers can take the read side

lock to lookup the mapping and hence can run in parallel.

The older execbuff mode of binding do not need this lock.

+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs to

be held while binding/unbinding a vma in the async worker and while updating

dma-resv fence list of an object. Note that private BOs of a VM will all

share a dma-resv object.

The future system allocator support will use the HMM prescribed locking

instead.

+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of

invalidated vmas (due to eviction and userptr invalidation) etc.

+When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we need to +find the backing storage (dma_resv lock for gem objects, and hmm/core mm for +system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm lists, +so won't ever need lock-C.

+VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to avoid +performance degradation. We will also need support for bulk LRU movement of +VM_BIND objects to avoid additional latencies in execbuff path.

+The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call with +that VM). So, bulk LRU movement of page table pages is also needed.

+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture.

+VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object dependency +setting (either through explicit or implicit mechanism).

+When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff submission.

+Also, in VM_BIND mode, use dma-resv apis for determining object activeness +(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not use the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for igfx.

+Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated).

+Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan and Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful.

+VM_BIND Compute support +========================

+User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl.

+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.

+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.

+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.

+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/

+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute must opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during +context creation. The dma-fence based user interfaces like gem_wait ioctl and +execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode.

+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption.

+As this support for context suspension using a preempt fence and the resume work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute mode +contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment.

+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the directly +submitted jobs.

+Other VM_BIND use cases +========================

+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debugged) and attached +to GPU via vm_bind interface.

+GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer VM_BIND mode of +binding will require using dma-fence to ensure residency, the GPU page faults +mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries.

+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.

+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.

+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled.

+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +use cases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.

+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature

do not use it and complexity it brings in is probably more than the

performance advantage we get in legacy execbuff case.

+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using

it. Instead use underlying BO's dma-resv fence list to determine if a i915_vma

is active or not.

+VM_BIND UAPI +=============

+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
 i915_scheduler.rst
+.. toctree::

i915_vm_bind.rst

-- 2.21.0.rc0.32.g243a4c7e27

Zeng, Oak

6 Jun 6 Jun

8:45 p.m.

New subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

Regards, Oak

...

-----Original Message----- From: Vishwanathapura, Niranjana niranjana.vishwanathapura@intel.com Sent: June 2, 2022 4:49 PM To: Zeng, Oak oak.zeng@intel.com Cc: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter, Daniel daniel.vetter@intel.com; Brost, Matthew matthew.brost@intel.com; Hellstrom, Thomas thomas.hellstrom@intel.com; jason@jlekstrand.net; Wilson, Chris P chris.p.wilson@intel.com; christian.koenig@amd.com Subject: Re: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

On Wed, Jun 01, 2022 at 07:13:16PM -0700, Zeng, Oak wrote:

...
Regards, Oak

...
-----Original Message----- From: dri-devel dri-devel-bounces@lists.freedesktop.org On Behalf Of Niranjana Vishwanathapura Sent: May 17, 2022 2:32 PM To: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Vetter, Daniel daniel.vetter@intel.com Cc: Brost, Matthew matthew.brost@intel.com; Hellstrom, Thomas thomas.hellstrom@intel.com; jason@jlekstrand.net; Wilson, Chris P chris.p.wilson@intel.com; christian.koenig@amd.com Subject: [RFC v3 1/3] drm/doc/rfc: VM_BIND feature design document

VM_BIND design document with description of intended use cases.

v2: Add more documentation and format as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura

niranjana.vishwanathapura@intel.com

Documentation/driver-api/dma-buf.rst | 2 + Documentation/gpu/rfc/i915_vm_bind.rst | 304 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 3 files changed, 310 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver- api/dma-buf.rst index 36a76cbe9095..64cb924ec5bb 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -200,6 +200,8 @@ DMA Fence uABI/Sync File .. kernel-doc:: include/linux/sync_file.h :internal:

+.. _indefinite_dma_fences:

Indefinite DMA Fences
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst
b/Documentation/gpu/rfc/i915_vm_bind.rst
new file mode 100644
index 000000000000..f1be560d313c
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
@@ -0,0 +1,304 @@
+==========================================
+I915 VM_BIND feature design and use cases
+==========================================
+
+VM_BIND feature
+================
+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM
buffer
+objects (BOs) or sections of a BOs at specified GPU virtual addresses on a
+specified address space (VM). These mappings (also referred to as persistent
+mappings) will be persistent across multiple GPU submissions (execbuff calls)
+issued by the UMD, without user having to provide a list of all required
+mappings during each submission (as required by older execbuff mode).
+
+VM_BIND/UNBIND ioctls will support 'in' and 'out' fences to allow userpace
+to specify how the binding/unbinding should sync with other operations
+like the GPU job submission. These fences will be timeline 'drm_syncobj's
+for non-Compute contexts (See struct
drm_i915_vm_bind_ext_timeline_fences).
+For Compute contexts, they will be user/memory fences (See struct
+drm_i915_vm_bind_ext_user_fence).
+
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
+User has to opt-in for VM_BIND mode of binding for an address space (VM)
+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND
extension.
+
+VM_BIND/UNBIND ioctl will immediately start binding/unbinding the
mapping in

...
...
an +async worker. The binding and unbinding will work like a special GPU engine. +The binding and unbinding operations are serialized and will wait on specified +input fences before the operation and will signal the output fences upon the +completion of the operation. Due to serialization, completion of an operation +will also indicate that all previous operations are also complete.

Hi,

Is user required to wait for the out fence be signaled before submit a gpu job

using the vm_bind address?

...
Or is user required to order the gpu job to make gpu job run after vm_bind out

fence signaled?

...
Thanks Oak, Either should be fine and up to user how to use vm_bind/unbind out-fence.

...
I think there could be different behavior on a non-faultable platform and a

faultable platform, such as on a non-faultable

...
Platform, gpu job is required to be order after vm_bind out fence signaling; and

on a faultable platform, there is no such

...
Restriction since vm bind can be finished in the fault handler?

With GPU page faults handler, out fence won't be needed as residency is purely managed by page fault handler populating page tables (there is a mention of it in GPU Page Faults section below).

...
Should we document such thing?

We don't talk much about GPU page faults case in this document as that may warrent a separate rfc when we add page faults support. We did mention it in couple places to ensure our locking design here is extensible to gpu page faults case.

Ok, that makes sense to me. Thanks for explaining.

Regards, Oak

...

Niranjana

...
Regards, Oak

...

+VM_BIND features include:

+* Multiple Virtual Address (VA) mappings can map to the same physical

pages

...
...

of an object (aliasing).

+* VA mapping can map to a partial section of the BO (partial binding). +* Support capture of persistent mappings in the dump upon GPU error. +* TLB is flushed upon unbind completion. Batching of TLB flushes in some

use cases will be helpful.

+* Asynchronous vm_bind and vm_unbind support with 'in' and 'out' fences. +* Support for userptr gem objects (no special uapi is required for this).

+Execbuff ioctl in VM_BIND mode +------------------------------- +The execbuff ioctl handling in VM_BIND mode differs significantly from the +older method. A VM in VM_BIND mode will not support older execbuff

mode of

...
...
+binding. In VM_BIND mode, execbuff ioctl will not accept any execlist.

Hence,

...
...
+no support for implicit sync. It is expected that the below work will be able +to support requirements of object dependency setting in all use cases:

+"dma-buf: Add an API for exporting sync files" +(https://lwn.net/Articles/859290/)

+This also means, we need an execbuff extension to pass in the batch +buffer addresses (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

+If at all execlist support in execbuff ioctl is deemed necessary for +implicit sync in certain use cases, then support can be added later.

+In VM_BIND mode, VA allocation is completely managed by the user instead

of

...
...
+the i915 driver. Hence all VA assignment, eviction are not applicable in +VM_BIND mode. Also, for determining object activeness, VM_BIND mode

will

...
...
not +be using the i915_vma active reference tracking. It will instead use dma-resv +object for that (See `VM_BIND dma_resv usage`_).

+So, a lot of existing code in the execbuff path like relocations, VA evictions, +vma lookup table, implicit sync, vma active reference tracking etc., are not +applicable in VM_BIND mode. Hence, the execbuff path needs to be cleaned

up

...
...
+by clearly separating out the functionalities where the VM_BIND mode

differs

...
...
+from older method and they should be moved to separate files.

+VM_PRIVATE objects +------------------- +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.

+VM_BIND feature introduces an optimization where user can create BO

which

...
...
+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped

on

...
...
+the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each

execbuff

...
...
+submission, they need only one dma-resv fence list updated. Thus, the fast +path (where required mappings are already bound) submission latency is

O(1)

...
...
+w.r.t the number of VM private BOs.

+VM_BIND locking hirarchy +------------------------- +The locking design here supports the older (execlist based) execbuff mode,

the

...
...
+newer VM_BIND mode, the VM_BIND mode with GPU page faults and

possible

...
...
future +system allocator support (See `Shared Virtual Memory (SVM) support`_). +The older execbuff mode and the newer VM_BIND mode without page

faults

...
...
manages +residency of backing storage using dma_fence. The VM_BIND mode with

page

...
...
faults +and the system allocator support do not use any dma_fence at all.

+VM_BIND locking order is as below.

+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This lock is taken in

vm_bind/vm_unbind ioctl calls, in the execbuff path and while releasing

the

...
...

mapping.

In future, when GPU page faults are supported, we can potentially use a

rwsem instead, so that multiple page fault handlers can take the read side

lock to lookup the mapping and hence can run in parallel.

The older execbuff mode of binding do not need this lock.

+2) Lock-B: The object's dma-resv lock will protect i915_vma state and needs

to

...
...

be held while binding/unbinding a vma in the async worker and while

updating

...
...

dma-resv fence list of an object. Note that private BOs of a VM will all

share a dma-resv object.

The future system allocator support will use the HMM prescribed locking

instead.

+3) Lock-C: Spinlock/s to protect some of the VM's lists like the list of

invalidated vmas (due to eviction and userptr invalidation) etc.

+When GPU page faults are supported, the execbuff path do not take any of these +locks. There we will simply smash the new batch buffer address into the ring and +then tell the scheduler run that. The lock taking only happens from the page +fault handler, where we take lock-A in read mode, whichever lock-B we

need to

...
...
+find the backing storage (dma_resv lock for gem objects, and hmm/core mm

for

...
...
+system allocator) and some additional locks (lock-D) for taking care of page +table races. Page fault mode should not need to ever manipulate the vm

lists,

...
...
+so won't ever need lock-C.

+VM_BIND LRU handling +--------------------- +We need to ensure VM_BIND mapped objects are properly LRU tagged to

avoid

...
...
+performance degradation. We will also need support for bulk LRU movement

of

...
...
+VM_BIND objects to avoid additional latencies in execbuff path.

+The page table pages are similar to VM_BIND mapped objects (See +`Evictable page table allocations`_) and are maintained per VM and needs to +be pinned in memory when VM is made active (ie., upon an execbuff call

with

...
...
+that VM). So, bulk LRU movement of page table pages is also needed.

+The i915 shrinker LRU has stopped being an LRU. So, it should also be moved +over to the ttm LRU in some fashion to make sure we once again have a reasonable +and consistent memory aging and reclaim architecture.

+VM_BIND dma_resv usage +----------------------- +Fences needs to be added to all VM_BIND mapped objects. During each execbuff +submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage to prevent +over sync (See enum dma_resv_usage). One can override it with either +DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during

object

...
...
dependency +setting (either through explicit or implicit mechanism).

+When vm_bind is called for a non-private object while the VM is already +active, the fences need to be copied from VM's shared dma-resv object +(common to all private objects of the VM) to this non-private object. +If this results in performance degradation, then some optimization will +be needed here. This is not a problem for VM's private objects as they use +shared dma-resv object which is always updated on each execbuff

submission.

...
...

+Also, in VM_BIND mode, use dma-resv apis for determining object

activeness

...
...
+(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and do not

use

...
...
the +older i915_vma active reference tracking which is deprecated. This should be +easier to get it working with the current TTM backend. We can remove the +i915_vma active reference tracking fully while supporting TTM backend for

igfx.

...
...

+Evictable page table allocations +--------------------------------- +Make pagetable allocations evictable and manage them similar to VM_BIND +mapped objects. Page table pages are similar to persistent mappings of a +VM (difference here are that the page table pages will not have an i915_vma +structure and after swapping pages back in, parent page link needs to be +updated).

+Mesa use case +-------------- +VM_BIND can potentially reduce the CPU overhead in Mesa (both Vulkan

and

...
...
Iris), +hence improving performance of CPU-bound applications. It also allows us to +implement Vulkan's Sparse Resources. With increasing GPU hardware performance, +reducing CPU overhead becomes more impactful.

+VM_BIND Compute support +========================

+User/Memory Fence +------------------ +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the

user

...
...
+supplied filter. User/Memory fence is a <address, value> pair. To signal the +user fence, specified value will be written at the specified virtual address +and wakeup the waiting process. User can wait on a user fence with the +gem_wait_user_fence ioctl.

+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal a user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer

to

...
...
+signal it.

+User/Memory fence can also be supplied to the kernel driver to signal/wake

up

...
...
+the user process after completion of an asynchronous operation.

+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.

+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/

+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of

time.

...
...
+Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence (See struct drm_i915_vm_bind_ext_user_fence). Compute

must

...
...
opt-in +for this mechanism with I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING

flag

...
...
during +context creation. The dma-fence based user interfaces like gem_wait ioctl

and

...
...
+execbuff out fence are not allowed on long running contexts. Implicit sync is +not valid as well and is anyway not supported in VM_BIND mode.

+Where GPU page faults are not available, kernel driver upon buffer

invalidation

...
...
+will initiate a suspend (preemption) of long running context with a dma-

fence

...
...
+attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context preempt fence (also called suspend fence) proxying +as i915_request fence. This suspend fence is enabled when someone tries to wait +on it, which then triggers the context preemption.

+As this support for context suspension using a preempt fence and the

resume

...
...
work +for the compute mode contexts can get tricky to get it right, it is better to +add this support in drm scheduler so that multiple drivers can make use of it. +That means, it will have a dependency on i915 drm scheduler conversion with GuC +scheduler backend. This should be fine, as the plan is to support compute

mode

...
...
+contexts only with GuC scheduler backend (at least initially). This is much +easier to support with VM_BIND mode compared to the current heavier execbuff +path resource attachment.

+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through

execbuff

...
...
+ioctl. This is made possible by VM_BIND is not being synchronized against +execbuff. VM_BIND allows bind/unbind of mappings required for the

directly

...
...
+submitted jobs.

+Other VM_BIND use cases +========================

+Debugger +--------- +With debug event interface user space process (debugger) is able to keep

track

...
...
+of and act upon resources created by another process (debugged) and

attached

...
...
+to GPU via vm_bind interface.

+GPU page faults +---------------- +GPU page faults when supported (in future), will only be supported in the +VM_BIND mode. While both the older execbuff mode and the newer

VM_BIND

...
...
mode of +binding will require using dma-fence to ensure residency, the GPU page

faults

...
...
+mode when supported, will not use any dma-fence as residency is purely managed +by installing and removing/invalidating page table entries.

+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only mapping, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.

+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.

+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without

gem

...
...
BO +abstraction) using the HMM interface. SVM is only supported with GPU page +faults enabled.

+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its

own

...
...
+use cases to support and the locking requirements requires proper

integration

...
...
+with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.

+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature

do not use it and complexity it brings in is probably more than the

performance advantage we get in legacy execbuff case.

+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. VM_BIND feature will not be using

it. Instead use underlying BO's dma-resv fence list to determine if a

i915_vma

...
...

is active or not.

+VM_BIND UAPI +=============

+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst

b/Documentation/gpu/rfc/index.rst

...
...
index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
 i915_scheduler.rst
+.. toctree::

i915_vm_bind.rst

-- 2.21.0.rc0.32.g243a4c7e27

Niranjana Vishwanathapura

17 May 17 May

6:32 p.m.

New subject: [RFC v3 2/3] drm/i915: Update i915 uapi documentation

Add some missing i915 upai documentation which the new i915 VM_BIND feature documentation will be refer to.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com --- include/uapi/drm/i915_drm.h | 153 +++++++++++++++++++++++++++--------- 1 file changed, 116 insertions(+), 37 deletions(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h index a2def7b27009..8c834a31b56f 100644 --- a/include/uapi/drm/i915_drm.h +++ b/include/uapi/drm/i915_drm.h @@ -751,9 +751,16 @@ typedef struct drm_i915_irq_wait {

/* Must be kept compact -- no holes and well documented */

+/** + * typedef drm_i915_getparam_t - Driver parameter query structure. + */ typedef struct drm_i915_getparam { + /** @param: Driver parameter to query. */ __s32 param; - /* + + /** + * @value: Address of memory where queried value should be put. + * * WARNING: Using pointers instead of fixed-size u64 means we need to write * compat32 code. Don't repeat this mistake. */ @@ -1239,76 +1246,114 @@ struct drm_i915_gem_exec_object2 { __u64 rsvd2; };

+/** + * struct drm_i915_gem_exec_fence - An input or output fence for the execbuff + * ioctl. + * + * The request will wait for input fence to signal before submission. + * + * The returned output fence will be signaled after the completion of the + * request. + */ struct drm_i915_gem_exec_fence { - /** - * User's handle for a drm_syncobj to wait on or signal. - */ + /** @handle: User's handle for a drm_syncobj to wait on or signal. */ __u32 handle;

+ /** + * @flags: Supported flags are, + * + * I915_EXEC_FENCE_WAIT: + * Wait for the input fence before request submission. + * + * I915_EXEC_FENCE_SIGNAL: + * Return request completion fence as output + */ + __u32 flags; #define I915_EXEC_FENCE_WAIT (1<<0) #define I915_EXEC_FENCE_SIGNAL (1<<1) #define __I915_EXEC_FENCE_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_SIGNAL << 1)) - __u32 flags; };

-/* - * See drm_i915_gem_execbuffer_ext_timeline_fences. - */ -#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0 - -/* +/** + * struct drm_i915_gem_execbuffer_ext_timeline_fences - Timeline fences + * for execbuff. + * * This structure describes an array of drm_syncobj and associated points for * timeline variants of drm_syncobj. It is invalid to append this structure to * the execbuf if I915_EXEC_FENCE_ARRAY is set. */ struct drm_i915_gem_execbuffer_ext_timeline_fences { +#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0 + /** @base: Extension link. See struct i915_user_extension. */ struct i915_user_extension base;

/** - * Number of element in the handles_ptr & value_ptr arrays. + * @fence_count: Number of element in the @handles_ptr & @value_ptr + * arrays. */ __u64 fence_count;

/** - * Pointer to an array of struct drm_i915_gem_exec_fence of length - * fence_count. + * @handles_ptr: Pointer to an array of struct drm_i915_gem_exec_fence + * of length @fence_count. */ __u64 handles_ptr;

/** - * Pointer to an array of u64 values of length fence_count. Values - * must be 0 for a binary drm_syncobj. A Value of 0 for a timeline - * drm_syncobj is invalid as it turns a drm_syncobj into a binary one. + * @values_ptr: Pointer to an array of u64 values of length + * @fence_count. + * Values must be 0 for a binary drm_syncobj. A Value of 0 for a + * timeline drm_syncobj is invalid as it turns a drm_syncobj into a + * binary one. */ __u64 values_ptr; };

+/** + * struct drm_i915_gem_execbuffer2 - Structure for execbuff submission + */ struct drm_i915_gem_execbuffer2 { - /** - * List of gem_exec_object2 structs - */ + /** @buffers_ptr: Pointer to a list of gem_exec_object2 structs */ __u64 buffers_ptr; + + /** @buffer_count: Number of elements in @buffers_ptr array */ __u32 buffer_count;

- /** Offset in the batchbuffer to start execution from. */ + /** + * @batch_start_offset: Offset in the batchbuffer to start execution + * from. + */ __u32 batch_start_offset; - /** Bytes used in batchbuffer from batch_start_offset */ + + /** @batch_len: Bytes used in batchbuffer from batch_start_offset */ __u32 batch_len; + + /** @DR1: deprecated */ __u32 DR1; + + /** @DR4: deprecated */ __u32 DR4; + + /** @num_cliprects: See @cliprects_ptr */ __u32 num_cliprects; + /** - * This is a struct drm_clip_rect *cliprects if I915_EXEC_FENCE_ARRAY - * & I915_EXEC_USE_EXTENSIONS are not set. + * @cliprects_ptr: Kernel clipping was a DRI1 misfeature. + * + * It is invalid to use this field if I915_EXEC_FENCE_ARRAY or + * I915_EXEC_USE_EXTENSIONS flags are not set. * * If I915_EXEC_FENCE_ARRAY is set, then this is a pointer to an array - * of struct drm_i915_gem_exec_fence and num_cliprects is the length - * of the array. + * of &drm_i915_gem_exec_fence and @num_cliprects is the length of the + * array. * * If I915_EXEC_USE_EXTENSIONS is set, then this is a pointer to a - * single struct i915_user_extension and num_cliprects is 0. + * single &i915_user_extension and num_cliprects is 0. */ __u64 cliprects_ptr; + + /** @flags: Execbuff flags */ + __u64 flags; #define I915_EXEC_RING_MASK (0x3f) #define I915_EXEC_DEFAULT (0<<0) #define I915_EXEC_RENDER (1<<0) @@ -1326,10 +1371,6 @@ struct drm_i915_gem_execbuffer2 { #define I915_EXEC_CONSTANTS_REL_GENERAL (0<<6) /* default */ #define I915_EXEC_CONSTANTS_ABSOLUTE (1<<6) #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */ - __u64 flags; - __u64 rsvd1; /* now used for context info */ - __u64 rsvd2; -};

/** Resets the SO write offset registers for transform feedback on gen7. */ #define I915_EXEC_GEN7_SOL_RESET (1<<8) @@ -1432,9 +1473,23 @@ struct drm_i915_gem_execbuffer2 { * drm_i915_gem_execbuffer_ext enum. */ #define I915_EXEC_USE_EXTENSIONS (1 << 21) - #define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))

+ /** @rsvd1: Context id */ + __u64 rsvd1; + + /** + * @rsvd2: in and out sync_file file descriptors. + * + * When I915_EXEC_FENCE_IN or I915_EXEC_FENCE_SUBMIT flag is set, the + * lower 32 bits of this field will have the in sync_file fd (input). + * + * When I915_EXEC_FENCE_OUT flag is set, the upper 32 bits of this + * field will have the out sync_file fd (output). + */ + __u64 rsvd2; +}; + #define I915_EXEC_CONTEXT_ID_MASK (0xffffffff) #define i915_execbuffer2_set_context_id(eb2, context) \ (eb2).rsvd1 = context & I915_EXEC_CONTEXT_ID_MASK @@ -1814,13 +1869,32 @@ struct drm_i915_gem_context_create { __u32 pad; };

+/** + * struct drm_i915_gem_context_create_ext - Structure for creating contexts. + */ struct drm_i915_gem_context_create_ext { - __u32 ctx_id; /* output: id of new context*/ + /** @ctx_id: Id of the created context (output) */ + __u32 ctx_id; + + /** + * @flags: Supported flags are, + * + * I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS: + * + * Extensions may be appended to this structure and driver must check + * for those. + * + * I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE + * + * Created context will have single timeline. + */ __u32 flags; #define I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS (1u << 0) #define I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE (1u << 1) #define I915_CONTEXT_CREATE_FLAGS_UNKNOWN \ (-(I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE << 1)) + + /** @extensions: Zero-terminated chain of extensions. */ __u64 extensions; };

@@ -2387,7 +2461,9 @@ struct drm_i915_gem_context_destroy { __u32 pad; };

-/* +/** + * struct drm_i915_gem_vm_control - Structure to create or destroy VM. + * * DRM_I915_GEM_VM_CREATE - * * Create a new virtual memory address space (ppGTT) for use within a context @@ -2397,20 +2473,23 @@ struct drm_i915_gem_context_destroy { * The id of new VM (bound to the fd) for use with I915_CONTEXT_PARAM_VM is * returned in the outparam @id. * - * No flags are defined, with all bits reserved and must be zero. - * * An extension chain maybe provided, starting with @extensions, and terminated * by the @next_extension being 0. Currently, no extensions are defined. * * DRM_I915_GEM_VM_DESTROY - * - * Destroys a previously created VM id, specified in @id. + * Destroys a previously created VM id, specified in @vm_id. * * No extensions or flags are allowed currently, and so must be zero. */ struct drm_i915_gem_vm_control { + /** @extensions: Zero-terminated chain of extensions. */ __u64 extensions; + + /** @flags: reserved for future usage, currently MBZ */ __u32 flags; + + /** @vm_id: Id of the VM created or to be destroyed */ __u32 vm_id; };

-- 2.21.0.rc0.32.g243a4c7e27

Matthew Auld

8 Jun 8 Jun

11:24 a.m.

New subject: [RFC v3 2/3] drm/i915: Update i915 uapi documentation

On Tue, 17 May 2022 at 19:32, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

...

Add some missing i915 upai documentation which the new i915 VM_BIND feature documentation will be refer to.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

include/uapi/drm/i915_drm.h | 153 +++++++++++++++++++++++++++--------- 1 file changed, 116 insertions(+), 37 deletions(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h index a2def7b27009..8c834a31b56f 100644 --- a/include/uapi/drm/i915_drm.h +++ b/include/uapi/drm/i915_drm.h @@ -751,9 +751,16 @@ typedef struct drm_i915_irq_wait {

/* Must be kept compact -- no holes and well documented */

+/**

typedef drm_i915_getparam_t - Driver parameter query structure.

This one looks funny in the rendered html for some reason, since it doesn't seem to emit the @param and @value, I guess it doesn't really understand typedef <struct> ?

Maybe make this "struct drm_i915_getparam - Driver parameter query structure." ?

...

*/

typedef struct drm_i915_getparam {
  /** @param: Driver parameter to query. */
  __s32 param;
  /*
  /**
   * @value: Address of memory where queried value should be put.
   *
   * WARNING: Using pointers instead of fixed-size u64 means we need to write
   * compat32 code. Don't repeat this mistake.
   */
@@ -1239,76 +1246,114 @@ struct drm_i915_gem_exec_object2 { __u64 rsvd2; };

+/**

struct drm_i915_gem_exec_fence - An input or output fence for the execbuff

s/execbuff/execbuf/, at least that seems to be what we use elsewhere, AFAICT.

...

ioctl.

The request will wait for input fence to signal before submission.

The returned output fence will be signaled after the completion of the

request.

*/

struct drm_i915_gem_exec_fence {
  /**
   * User's handle for a drm_syncobj to wait on or signal.
   */
  /** @handle: User's handle for a drm_syncobj to wait on or signal. */
  __u32 handle;
  /**
   * @flags: Supported flags are,

are:

...

   *
   * I915_EXEC_FENCE_WAIT:
   * Wait for the input fence before request submission.
   *
   * I915_EXEC_FENCE_SIGNAL:
   * Return request completion fence as output
   */
  __u32 flags;
#define I915_EXEC_FENCE_WAIT (1<<0) #define I915_EXEC_FENCE_SIGNAL (1<<1) #define __I915_EXEC_FENCE_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_SIGNAL << 1))
  __u32 flags;
};

-/*

See drm_i915_gem_execbuffer_ext_timeline_fences.

*/

-#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0

-/* +/**

struct drm_i915_gem_execbuffer_ext_timeline_fences - Timeline fences

for execbuff.

This structure describes an array of drm_syncobj and associated points for

timeline variants of drm_syncobj. It is invalid to append this structure to

the execbuf if I915_EXEC_FENCE_ARRAY is set.

*/

struct drm_i915_gem_execbuffer_ext_timeline_fences { +#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
  /** @base: Extension link. See struct i915_user_extension. */
  struct i915_user_extension base;

  /**
   * Number of element in the handles_ptr & value_ptr arrays.
   * @fence_count: Number of element in the @handles_ptr & @value_ptr

s/element/elements/

...

   * arrays.
   */
  __u64 fence_count;

  /**

   * Pointer to an array of struct drm_i915_gem_exec_fence of length

```
   * fence_count.
```

   * @handles_ptr: Pointer to an array of struct drm_i915_gem_exec_fence

   * of length @fence_count.
   */
  __u64 handles_ptr;

  /**

   * Pointer to an array of u64 values of length fence_count. Values

   * must be 0 for a binary drm_syncobj. A Value of 0 for a timeline

   * drm_syncobj is invalid as it turns a drm_syncobj into a binary one.

   * @values_ptr: Pointer to an array of u64 values of length

```
   * @fence_count.
```

   * Values must be 0 for a binary drm_syncobj. A Value of 0 for a

   * timeline drm_syncobj is invalid as it turns a drm_syncobj into a

   * binary one.
   */
  __u64 values_ptr;

};

+/**

- struct drm_i915_gem_execbuffer2 - Structure for execbuff submission
*/

struct drm_i915_gem_execbuffer2 {

```
  /**
```
```
   * List of gem_exec_object2 structs
```
```
   */
```

  /** @buffers_ptr: Pointer to a list of gem_exec_object2 structs */
  __u64 buffers_ptr;

  /** @buffer_count: Number of elements in @buffers_ptr array */
  __u32 buffer_count;

  /** Offset in the batchbuffer to start execution from. */

```
  /**
```

   * @batch_start_offset: Offset in the batchbuffer to start execution

```
   * from.
```
```
   */
  __u32 batch_start_offset;
```

  /** Bytes used in batchbuffer from batch_start_offset */

  /** @batch_len: Bytes used in batchbuffer from batch_start_offset */

"Length in bytes of the batchbuffer, otherwise assumed to be the object size if zero, starting from the @batch_start_offset."

...

    __u32 batch_len;

```
  /** @DR1: deprecated */
  __u32 DR1;
```
```
  /** @DR4: deprecated */
  __u32 DR4;
```

  /** @num_cliprects: See @cliprects_ptr */
  __u32 num_cliprects;

```
  /**
```

   * This is a struct drm_clip_rect *cliprects if I915_EXEC_FENCE_ARRAY

   * & I915_EXEC_USE_EXTENSIONS are not set.

   * @cliprects_ptr: Kernel clipping was a DRI1 misfeature.

```
   *
```

   * It is invalid to use this field if I915_EXEC_FENCE_ARRAY or

   * I915_EXEC_USE_EXTENSIONS flags are not set.
   *
   * If I915_EXEC_FENCE_ARRAY is set, then this is a pointer to an array

   * of struct drm_i915_gem_exec_fence and num_cliprects is the length

```
   * of the array.
```

   * of &drm_i915_gem_exec_fence and @num_cliprects is the length of the

   * array.
   *
   * If I915_EXEC_USE_EXTENSIONS is set, then this is a pointer to a

   * single struct i915_user_extension and num_cliprects is 0.

   * single &i915_user_extension and num_cliprects is 0.
   */
  __u64 cliprects_ptr;

```
  /** @flags: Execbuff flags */
```

s/Execbuff/Execbuf/

Could maybe document the I915_EXEC_* also, or maybe not ;)

...

  __u64 flags;
#define I915_EXEC_RING_MASK (0x3f) #define I915_EXEC_DEFAULT (0<<0) #define I915_EXEC_RENDER (1<<0) @@ -1326,10 +1371,6 @@ struct drm_i915_gem_execbuffer2 { #define I915_EXEC_CONSTANTS_REL_GENERAL (0<<6) /* default */ #define I915_EXEC_CONSTANTS_ABSOLUTE (1<<6) #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */
  __u64 flags;
  __u64 rsvd1; /* now used for context info */
  __u64 rsvd2;
-};

/** Resets the SO write offset registers for transform feedback on gen7. */ #define I915_EXEC_GEN7_SOL_RESET (1<<8) @@ -1432,9 +1473,23 @@ struct drm_i915_gem_execbuffer2 {

drm_i915_gem_execbuffer_ext enum.

*/ #define I915_EXEC_USE_EXTENSIONS (1 << 21)

#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
  /** @rsvd1: Context id */
  __u64 rsvd1;
  /**
   * @rsvd2: in and out sync_file file descriptors.
   *
   * When I915_EXEC_FENCE_IN or I915_EXEC_FENCE_SUBMIT flag is set, the
   * lower 32 bits of this field will have the in sync_file fd (input).
   *
   * When I915_EXEC_FENCE_OUT flag is set, the upper 32 bits of this
   * field will have the out sync_file fd (output).
   */
  __u64 rsvd2;
+};

#define I915_EXEC_CONTEXT_ID_MASK (0xffffffff) #define i915_execbuffer2_set_context_id(eb2, context) \ (eb2).rsvd1 = context & I915_EXEC_CONTEXT_ID_MASK @@ -1814,13 +1869,32 @@ struct drm_i915_gem_context_create { __u32 pad; };

+/**

struct drm_i915_gem_context_create_ext - Structure for creating contexts.

*/

struct drm_i915_gem_context_create_ext {
  __u32 ctx_id; /* output: id of new context*/
  /** @ctx_id: Id of the created context (output) */
  __u32 ctx_id;
  /**
   * @flags: Supported flags are,

are:

...

```
   *
```

   * I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS:

```
   *
```

   * Extensions may be appended to this structure and driver must check

```
   * for those.
```

Maybe add "See @extensions.", and then....

...

   *
   * I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE
   *
   * Created context will have single timeline.
   */
  __u32 flags;
#define I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS (1u << 0) #define I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE (1u << 1) #define I915_CONTEXT_CREATE_FLAGS_UNKNOWN \ (-(I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE << 1))
  /** @extensions: Zero-terminated chain of extensions. */

...here perhaps list the extensions, and maybe also move the #define for each here? See for example @extensions in drm_i915_gem_create_ext.

Reviewed-by: Matthew Auld matthew.auld@intel.com

...

    __u64 extensions;
};

@@ -2387,7 +2461,9 @@ struct drm_i915_gem_context_destroy { __u32 pad; };

-/* +/**

struct drm_i915_gem_vm_control - Structure to create or destroy VM.

DRM_I915_GEM_VM_CREATE -

Create a new virtual memory address space (ppGTT) for use within a context

@@ -2397,20 +2473,23 @@ struct drm_i915_gem_context_destroy {

The id of new VM (bound to the fd) for use with I915_CONTEXT_PARAM_VM is

returned in the outparam @id.

No flags are defined, with all bits reserved and must be zero.

An extension chain maybe provided, starting with @extensions, and terminated

by the @next_extension being 0. Currently, no extensions are defined.

DRM_I915_GEM_VM_DESTROY -

Destroys a previously created VM id, specified in @id.

Destroys a previously created VM id, specified in @vm_id.

No extensions or flags are allowed currently, and so must be zero.

*/

struct drm_i915_gem_vm_control {
  /** @extensions: Zero-terminated chain of extensions. */
  __u64 extensions;
  /** @flags: reserved for future usage, currently MBZ */
  __u32 flags;
  /** @vm_id: Id of the VM created or to be destroyed */
  __u32 vm_id;
};

-- 2.21.0.rc0.32.g243a4c7e27

Niranjana Vishwanathapura

10 Jun 10 Jun

1:43 a.m.

New subject: [RFC v3 2/3] drm/i915: Update i915 uapi documentation

On Wed, Jun 08, 2022 at 12:24:04PM +0100, Matthew Auld wrote:

...

On Tue, 17 May 2022 at 19:32, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

...
Add some missing i915 upai documentation which the new i915 VM_BIND feature documentation will be refer to.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

include/uapi/drm/i915_drm.h | 153 +++++++++++++++++++++++++++--------- 1 file changed, 116 insertions(+), 37 deletions(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h index a2def7b27009..8c834a31b56f 100644 --- a/include/uapi/drm/i915_drm.h +++ b/include/uapi/drm/i915_drm.h @@ -751,9 +751,16 @@ typedef struct drm_i915_irq_wait {

/* Must be kept compact -- no holes and well documented */

+/**

typedef drm_i915_getparam_t - Driver parameter query structure.

This one looks funny in the rendered html for some reason, since it doesn't seem to emit the @param and @value, I guess it doesn't really understand typedef <struct> ?

Maybe make this "struct drm_i915_getparam - Driver parameter query structure." ?

Thanks Matt. Yah, there doesn't seems to be a good way to add kernel doc for this kind of declaration. 'struct drm_i915_getparam' also didn't help. I was able to fix it by first defining the structure and then adding a typedef for it. Not sure if that has any value, but at least we can get kernel doc for that.

...

*/

typedef struct drm_i915_getparam {
  /** @param: Driver parameter to query. */
  __s32 param;
  /*
  /**
   * @value: Address of memory where queried value should be put.
   *
   * WARNING: Using pointers instead of fixed-size u64 means we need to write
   * compat32 code. Don't repeat this mistake.
   */
@@ -1239,76 +1246,114 @@ struct drm_i915_gem_exec_object2 { __u64 rsvd2; };

+/**

struct drm_i915_gem_exec_fence - An input or output fence for the execbuff

s/execbuff/execbuf/, at least that seems to be what we use elsewhere, AFAICT.

...

ioctl.

The request will wait for input fence to signal before submission.

The returned output fence will be signaled after the completion of the

request.

*/

struct drm_i915_gem_exec_fence {
  /**
   * User's handle for a drm_syncobj to wait on or signal.
   */
  /** @handle: User's handle for a drm_syncobj to wait on or signal. */
  __u32 handle;
  /**
   * @flags: Supported flags are,

are:

...

   *
   * I915_EXEC_FENCE_WAIT:
   * Wait for the input fence before request submission.
   *
   * I915_EXEC_FENCE_SIGNAL:
   * Return request completion fence as output
   */
  __u32 flags;
#define I915_EXEC_FENCE_WAIT (1<<0) #define I915_EXEC_FENCE_SIGNAL (1<<1) #define __I915_EXEC_FENCE_UNKNOWN_FLAGS (-(I915_EXEC_FENCE_SIGNAL << 1))
  __u32 flags;
};

-/*

See drm_i915_gem_execbuffer_ext_timeline_fences.

*/

-#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0

-/* +/**

struct drm_i915_gem_execbuffer_ext_timeline_fences - Timeline fences

for execbuff.

This structure describes an array of drm_syncobj and associated points for

timeline variants of drm_syncobj. It is invalid to append this structure to

the execbuf if I915_EXEC_FENCE_ARRAY is set.

*/

struct drm_i915_gem_execbuffer_ext_timeline_fences { +#define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
  /** @base: Extension link. See struct i915_user_extension. */
  struct i915_user_extension base;

  /**
   * Number of element in the handles_ptr & value_ptr arrays.
   * @fence_count: Number of element in the @handles_ptr & @value_ptr

s/element/elements/

...

   * arrays.
   */
  __u64 fence_count;

  /**

   * Pointer to an array of struct drm_i915_gem_exec_fence of length

```
   * fence_count.
```

   * @handles_ptr: Pointer to an array of struct drm_i915_gem_exec_fence

   * of length @fence_count.
   */
  __u64 handles_ptr;

  /**

   * Pointer to an array of u64 values of length fence_count. Values

   * must be 0 for a binary drm_syncobj. A Value of 0 for a timeline

   * drm_syncobj is invalid as it turns a drm_syncobj into a binary one.

   * @values_ptr: Pointer to an array of u64 values of length

```
   * @fence_count.
```

   * Values must be 0 for a binary drm_syncobj. A Value of 0 for a

   * timeline drm_syncobj is invalid as it turns a drm_syncobj into a

   * binary one.
   */
  __u64 values_ptr;

};

+/**

- struct drm_i915_gem_execbuffer2 - Structure for execbuff submission
*/

struct drm_i915_gem_execbuffer2 {

```
  /**
```
```
   * List of gem_exec_object2 structs
```
```
   */
```

  /** @buffers_ptr: Pointer to a list of gem_exec_object2 structs */
  __u64 buffers_ptr;

  /** @buffer_count: Number of elements in @buffers_ptr array */
  __u32 buffer_count;

  /** Offset in the batchbuffer to start execution from. */

```
  /**
```

   * @batch_start_offset: Offset in the batchbuffer to start execution

```
   * from.
```
```
   */
  __u32 batch_start_offset;
```

  /** Bytes used in batchbuffer from batch_start_offset */

  /** @batch_len: Bytes used in batchbuffer from batch_start_offset */

"Length in bytes of the batchbuffer, otherwise assumed to be the object size if zero, starting from the @batch_start_offset."

...

    __u32 batch_len;

```
  /** @DR1: deprecated */
  __u32 DR1;
```
```
  /** @DR4: deprecated */
  __u32 DR4;
```

  /** @num_cliprects: See @cliprects_ptr */
  __u32 num_cliprects;

```
  /**
```

   * This is a struct drm_clip_rect *cliprects if I915_EXEC_FENCE_ARRAY

   * & I915_EXEC_USE_EXTENSIONS are not set.

   * @cliprects_ptr: Kernel clipping was a DRI1 misfeature.

```
   *
```

   * It is invalid to use this field if I915_EXEC_FENCE_ARRAY or

   * I915_EXEC_USE_EXTENSIONS flags are not set.
   *
   * If I915_EXEC_FENCE_ARRAY is set, then this is a pointer to an array

   * of struct drm_i915_gem_exec_fence and num_cliprects is the length

```
   * of the array.
```

   * of &drm_i915_gem_exec_fence and @num_cliprects is the length of the

   * array.
   *
   * If I915_EXEC_USE_EXTENSIONS is set, then this is a pointer to a

   * single struct i915_user_extension and num_cliprects is 0.

   * single &i915_user_extension and num_cliprects is 0.
   */
  __u64 cliprects_ptr;

```
  /** @flags: Execbuff flags */
```

s/Execbuff/Execbuf/

Could maybe document the I915_EXEC_* also, or maybe not ;)

We no longer need to refer to execbuf2 as vm_bind will have its own new execbuf3. But will keep the already added execbuf2 documentation.

...

...
  __u64 flags;
#define I915_EXEC_RING_MASK (0x3f) #define I915_EXEC_DEFAULT (0<<0) #define I915_EXEC_RENDER (1<<0) @@ -1326,10 +1371,6 @@ struct drm_i915_gem_execbuffer2 { #define I915_EXEC_CONSTANTS_REL_GENERAL (0<<6) /* default */ #define I915_EXEC_CONSTANTS_ABSOLUTE (1<<6) #define I915_EXEC_CONSTANTS_REL_SURFACE (2<<6) /* gen4/5 only */
  __u64 flags;
  __u64 rsvd1; /* now used for context info */
  __u64 rsvd2;
-};

/** Resets the SO write offset registers for transform feedback on gen7. */ #define I915_EXEC_GEN7_SOL_RESET (1<<8) @@ -1432,9 +1473,23 @@ struct drm_i915_gem_execbuffer2 {

drm_i915_gem_execbuffer_ext enum.

*/ #define I915_EXEC_USE_EXTENSIONS (1 << 21)

#define __I915_EXEC_UNKNOWN_FLAGS (-(I915_EXEC_USE_EXTENSIONS << 1))
  /** @rsvd1: Context id */
  __u64 rsvd1;
  /**
   * @rsvd2: in and out sync_file file descriptors.
   *
   * When I915_EXEC_FENCE_IN or I915_EXEC_FENCE_SUBMIT flag is set, the
   * lower 32 bits of this field will have the in sync_file fd (input).
   *
   * When I915_EXEC_FENCE_OUT flag is set, the upper 32 bits of this
   * field will have the out sync_file fd (output).
   */
  __u64 rsvd2;
+};

#define I915_EXEC_CONTEXT_ID_MASK (0xffffffff) #define i915_execbuffer2_set_context_id(eb2, context) \ (eb2).rsvd1 = context & I915_EXEC_CONTEXT_ID_MASK @@ -1814,13 +1869,32 @@ struct drm_i915_gem_context_create { __u32 pad; };

+/**

struct drm_i915_gem_context_create_ext - Structure for creating contexts.

*/

struct drm_i915_gem_context_create_ext {
  __u32 ctx_id; /* output: id of new context*/
  /** @ctx_id: Id of the created context (output) */
  __u32 ctx_id;
  /**
   * @flags: Supported flags are,
are:

...
   *
   * I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS:
   *
   * Extensions may be appended to this structure and driver must check
   * for those.
Maybe add "See @extensions.", and then....

...
   *
   * I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE
   *
   * Created context will have single timeline.
   */
  __u32 flags;
#define I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS (1u << 0) #define I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE (1u << 1) #define I915_CONTEXT_CREATE_FLAGS_UNKNOWN \ (-(I915_CONTEXT_CREATE_FLAGS_SINGLE_TIMELINE << 1))
  /** @extensions: Zero-terminated chain of extensions. */
...here perhaps list the extensions, and maybe also move the #define for each here? See for example @extensions in drm_i915_gem_create_ext.

Ok, will address all your comments above.

Niranjana

...

Reviewed-by: Matthew Auld matthew.auld@intel.com

...
    __u64 extensions;
};

@@ -2387,7 +2461,9 @@ struct drm_i915_gem_context_destroy { __u32 pad; };

-/* +/**

struct drm_i915_gem_vm_control - Structure to create or destroy VM.

DRM_I915_GEM_VM_CREATE -

Create a new virtual memory address space (ppGTT) for use within a context

@@ -2397,20 +2473,23 @@ struct drm_i915_gem_context_destroy {

The id of new VM (bound to the fd) for use with I915_CONTEXT_PARAM_VM is

returned in the outparam @id.

No flags are defined, with all bits reserved and must be zero.

An extension chain maybe provided, starting with @extensions, and terminated

by the @next_extension being 0. Currently, no extensions are defined.

DRM_I915_GEM_VM_DESTROY -

Destroys a previously created VM id, specified in @id.

Destroys a previously created VM id, specified in @vm_id.

No extensions or flags are allowed currently, and so must be zero.

*/

struct drm_i915_gem_vm_control {
  /** @extensions: Zero-terminated chain of extensions. */
  __u64 extensions;
  /** @flags: reserved for future usage, currently MBZ */
  __u32 flags;
  /** @vm_id: Id of the VM created or to be destroyed */
  __u32 vm_id;
};

-- 2.21.0.rc0.32.g243a4c7e27

Niranjana Vishwanathapura

17 May 17 May

6:32 p.m.

New subject: [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com --- Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/* + * Copyright © 2022 Intel Corporation + */ + +/** + * DOC: I915_PARAM_HAS_VM_BIND + * + * VM_BIND feature availability. + * See typedef drm_i915_getparam_t param. + */ +#define I915_PARAM_HAS_VM_BIND 57 + +/** + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND + * + * Flag to opt-in for VM_BIND mode of binding during VM creation. + * See struct drm_i915_gem_vm_control flags. + * + * A VM in VM_BIND mode will not support the older execbuff mode of binding. + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the + * &drm_i915_gem_execbuffer2.buffer_count must be 0). + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and + * &drm_i915_gem_execbuffer2.batch_len must be 0. + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided + * to pass in the batch buffer addresses. + * + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0 + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0. + */ +#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0) + +/** + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING + * + * Flag to declare context as long running. + * See struct drm_i915_gem_context_create_ext flags. + * + * Usage of dma-fence expects that they complete in reasonable amount of time. + * Compute on the other hand can be long running. Hence it is not appropriate + * for compute contexts to export request completion dma-fence to user. + * The dma-fence usage will be limited to in-kernel consumption only. + * Compute contexts need to use user/memory fence. + * + * So, long running contexts do not support output fences. Hence, + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected + * to be not used. + * + * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped + * to long running contexts. + */ +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2) + +/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f + +#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence) + +/** + * struct drm_i915_gem_vm_bind - VA to object mapping to bind. + * + * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU + * virtual address (VA) range to the section of an object that should be bound + * in the device page table of the specified address space (VM). + * The VA range specified must be unique (ie., not currently bound) and can + * be mapped to whole object or a section of the object (partial binding). + * Multiple VA mappings can be created to the same section of the object + * (aliasing). + */ +struct drm_i915_gem_vm_bind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id; + + /** @handle: Object handle */ + __u32 handle; + + /** @start: Virtual Address start to bind */ + __u64 start; + + /** @offset: Offset in object to bind */ + __u64 offset; + + /** @length: Length of mapping to bind */ + __u64 length; + + /** + * @flags: Supported flags are, + * + * I915_GEM_VM_BIND_READONLY: + * Mapping is read-only. + * + * I915_GEM_VM_BIND_CAPTURE: + * Capture this mapping in the dump upon GPU error. + */ + __u64 flags; +#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1) + + /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +}; + +/** + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind. + * + * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual + * address (VA) range that should be unbound from the device page table of the + * specified address space (VM). The specified VA range must match one of the + * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind + * completion. + */ +struct drm_i915_gem_vm_unbind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id; + + /** @rsvd: Reserved for future use; must be zero. */ + __u32 rsvd; + + /** @start: Virtual Address start to unbind */ + __u64 start; + + /** @length: Length of mapping to unbind */ + __u64 length; + + /** @flags: reserved for future usage, currently MBZ */ + __u64 flags; + + /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +}; + +/** + * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind + * or the vm_unbind work. + * + * The vm_bind or vm_unbind aync worker will wait for input fence to signal + * before starting the binding or unbinding. + * + * The vm_bind or vm_unbind async worker will signal the returned output fence + * after the completion of binding or unbinding. + */ +struct drm_i915_vm_bind_fence { + /** @handle: User's handle for a drm_syncobj to wait on or signal. */ + __u32 handle; + + /** + * @flags: Supported flags are, + * + * I915_VM_BIND_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +}; + +/** + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind + * and vm_unbind. + * + * This structure describes an array of timeline drm_syncobj and associated + * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's + * can be input or output fences (See struct drm_i915_vm_bind_fence). + */ +struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** + * @fence_count: Number of elements in the @handles_ptr & @value_ptr + * arrays. + */ + __u64 fence_count; + + /** + * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence + * of length @fence_count. + */ + __u64 handles_ptr; + + /** + * @values_ptr: Pointer to an array of u64 values of length + * @fence_count. + * Values must be 0 for a binary drm_syncobj. A Value of 0 for a + * timeline drm_syncobj is invalid as it turns a drm_syncobj into a + * binary one. + */ + __u64 values_ptr; +}; + +/** + * struct drm_i915_vm_bind_user_fence - An input or output user fence for the + * vm_bind or the vm_unbind work. + * + * The vm_bind or vm_unbind aync worker will wait for the input fence (value at + * @addr to become equal to @val) before starting the binding or unbinding. + * + * The vm_bind or vm_unbind async worker will signal the output fence after + * the completion of binding or unbinding by writing @val to memory location at + * @addr + */ +struct drm_i915_vm_bind_user_fence { + /** @addr: User/Memory fence qword aligned process virtual address */ + __u64 addr; + + /** @val: User/Memory fence value to be written after bind completion */ + __u64 val; + + /** + * @flags: Supported flags are, + * + * I915_VM_BIND_USER_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_USER_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ + (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) +}; + +/** + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind + * and vm_unbind. + * + * These user fences can be input or output fences + * (See struct drm_i915_vm_bind_user_fence). + */ +struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** @fence_count: Number of elements in the @user_fence_ptr array. */ + __u64 fence_count; + + /** + * @user_fence_ptr: Pointer to an array of + * struct drm_i915_vm_bind_user_fence of length @fence_count. + */ + __u64 user_fence_ptr; +}; + +/** + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer + * gpu virtual addresses. + * + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension + * must always be appended in the VM_BIND mode and it will be an error to + * append this extension in older non-VM_BIND mode. + */ +struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** @count: Number of addresses in the addr array. */ + __u32 count; + + /** @addr: An array of batch gpu virtual addresses. */ + __u64 addr[0]; +}; + +/** + * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion + * signaling extension. + * + * This extension allows user to attach a user fence (@addr, @value pair) to an + * execbuf to be signaled by the command streamer after the completion of first + * level batch, by writing the @value at specified @addr and triggering an + * interrupt. + * User can either poll for this user fence to signal or can also wait on it + * with i915_gem_wait_user_fence ioctl. + * This is very much usefaul for long running contexts where waiting on dma-fence + * by user (like i915_gem_wait ioctl) is not supported. + */ +struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** + * @addr: User/Memory fence qword aligned GPU virtual address. + * + * Address has to be a valid GPU virtual address at the time of + * first level batch completion. + */ + __u64 addr; + + /** + * @value: User/Memory fence Value to be written to above address + * after first level batch completes. + */ + __u64 value; + + /** @rsvd: Reserved for future extensions, MBZ */ + __u64 rsvd; +}; + +/** + * struct drm_i915_gem_create_ext_vm_private - Extension to make the object + * private to the specified VM. + * + * See struct drm_i915_gem_create_ext. + */ +struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** @vm_id: Id of the VM to which the object is private */ + __u32 vm_id; +}; + +/** + * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence. + * + * User/Memory fence can be woken up either by: + * + * 1. GPU context indicated by @ctx_id, or, + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT. + * @ctx_id is ignored when this flag is set. + * + * Wakeup condition is, + * ``((*addr & mask) op (value & mask))`` + * + * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>` + */ +struct drm_i915_gem_wait_user_fence { + /** @extensions: Zero-terminated chain of extensions. */ + __u64 extensions; + + /** @addr: User/Memory fence address */ + __u64 addr; + + /** @ctx_id: Id of the Context which will signal the fence. */ + __u32 ctx_id; + + /** @op: Wakeup condition operator */ + __u16 op; +#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7 + + /** + * @flags: Supported flags are, + * + * I915_UFENCE_WAIT_SOFT: + * + * To be woken up by i915 driver async worker (not by GPU). + * + * I915_UFENCE_WAIT_ABSTIME: + * + * Wait timeout specified as absolute time. + */ + __u16 flags; +#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2 + + /** @value: Wakeup value */ + __u64 value; + + /** @mask: Wakeup mask */ + __u64 mask; +#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull + + /** + * @timeout: Wait timeout in nanoseconds. + * + * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the + * absolute time in nsec. + */ + __s64 timeout; +};

-- 2.21.0.rc0.32.g243a4c7e27

Zanoni, Paulo R

19 May 19 May

11:07 p.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:

...

VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*

Copyright © 2022 Intel Corporation

*/

+/**

DOC: I915_PARAM_HAS_VM_BIND

VM_BIND feature availability.

See typedef drm_i915_getparam_t param.

*/

+#define I915_PARAM_HAS_VM_BIND 57

+/**

DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND

Flag to opt-in for VM_BIND mode of binding during VM creation.

See struct drm_i915_gem_vm_control flags.

A VM in VM_BIND mode will not support the older execbuff mode of binding.

In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the

&drm_i915_gem_execbuffer2.buffer_count must be 0).

Also, &drm_i915_gem_execbuffer2.batch_start_offset and

&drm_i915_gem_execbuffer2.batch_len must be 0.

DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided

to pass in the batch buffer addresses.

Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and

I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0

(not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be

set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

The buffers_ptr, buffer_count, batch_start_offset and batch_len fields

of struct drm_i915_gem_execbuffer2 are also not used and must be 0.

*/

From that description, it seems we have:

struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using extensions __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! __u64 flags; -> some flags must be 0 (new) __u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };

Based on that, why can't we just get drm_i915_gem_execbuffer3 instead of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how the VM was created (which is an entirely different ioctl).

From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll need to create a -11 entry just to deal with this ioctl.

+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0) + +/** + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING + * + * Flag to declare context as long running. + * See struct drm_i915_gem_context_create_ext flags. + * + * Usage of dma-fence expects that they complete in reasonable amount of time. + * Compute on the other hand can be long running. Hence it is not appropriate + * for compute contexts to export request completion dma-fence to user. + * The dma-fence usage will be limited to in-kernel consumption only. + * Compute contexts need to use user/memory fence. + * + * So, long running contexts do not support output fences. Hence, + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected + * to be not used. + * + * DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped + * to long running contexts. + */ +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2) + +/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f + +#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence) + +/** + * struct drm_i915_gem_vm_bind - VA to object mapping to bind. + * + * This structure is passed to VM_BIND ioctl and specifies the mapping of GPU + * virtual address (VA) range to the section of an object that should be bound + * in the device page table of the specified address space (VM). + * The VA range specified must be unique (ie., not currently bound) and can + * be mapped to whole object or a section of the object (partial binding). + * Multiple VA mappings can be created to the same section of the object + * (aliasing). + */ +struct drm_i915_gem_vm_bind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id; + + /** @handle: Object handle */ + __u32 handle; + + /** @start: Virtual Address start to bind */ + __u64 start; + + /** @offset: Offset in object to bind */ + __u64 offset; + + /** @length: Length of mapping to bind */ + __u64 length; + + /** + * @flags: Supported flags are, + * + * I915_GEM_VM_BIND_READONLY: + * Mapping is read-only. + * + * I915_GEM_VM_BIND_CAPTURE: + * Capture this mapping in the dump upon GPU error. + */ + __u64 flags; +#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1) + + /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +}; + +/** + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind. + * + * This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual + * address (VA) range that should be unbound from the device page table of the + * specified address space (VM). The specified VA range must match one of the + * mappings created with the VM_BIND ioctl. TLB is flushed upon unbind + * completion. + */ +struct drm_i915_gem_vm_unbind { + /** @vm_id: VM (address space) id to bind */ + __u32 vm_id; + + /** @rsvd: Reserved for future use; must be zero. */ + __u32 rsvd; + + /** @start: Virtual Address start to unbind */ + __u64 start; + + /** @length: Length of mapping to unbind */ + __u64 length; + + /** @flags: reserved for future usage, currently MBZ */ + __u64 flags; + + /** @extensions: 0-terminated chain of extensions for this mapping. */ + __u64 extensions; +}; + +/** + * struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind + * or the vm_unbind work. + * + * The vm_bind or vm_unbind aync worker will wait for input fence to signal + * before starting the binding or unbinding. + * + * The vm_bind or vm_unbind async worker will signal the returned output fence + * after the completion of binding or unbinding. + */ +struct drm_i915_vm_bind_fence { + /** @handle: User's handle for a drm_syncobj to wait on or signal. */ + __u32 handle; + + /** + * @flags: Supported flags are, + * + * I915_VM_BIND_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +}; + +/** + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind + * and vm_unbind. + * + * This structure describes an array of timeline drm_syncobj and associated + * points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's + * can be input or output fences (See struct drm_i915_vm_bind_fence). + */ +struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** + * @fence_count: Number of elements in the @handles_ptr & @value_ptr + * arrays. + */ + __u64 fence_count; + + /** + * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence + * of length @fence_count. + */ + __u64 handles_ptr; + + /** + * @values_ptr: Pointer to an array of u64 values of length + * @fence_count. + * Values must be 0 for a binary drm_syncobj. A Value of 0 for a + * timeline drm_syncobj is invalid as it turns a drm_syncobj into a + * binary one. + */ + __u64 values_ptr; +}; + +/** + * struct drm_i915_vm_bind_user_fence - An input or output user fence for the + * vm_bind or the vm_unbind work. + * + * The vm_bind or vm_unbind aync worker will wait for the input fence (value at + * @addr to become equal to @val) before starting the binding or unbinding. + * + * The vm_bind or vm_unbind async worker will signal the output fence after + * the completion of binding or unbinding by writing @val to memory location at + * @addr + */ +struct drm_i915_vm_bind_user_fence { + /** @addr: User/Memory fence qword aligned process virtual address */ + __u64 addr; + + /** @val: User/Memory fence value to be written after bind completion */ + __u64 val; + + /** + * @flags: Supported flags are, + * + * I915_VM_BIND_USER_FENCE_WAIT: + * Wait for the input fence before binding/unbinding + * + * I915_VM_BIND_USER_FENCE_SIGNAL: + * Return bind/unbind completion fence as output + */ + __u32 flags; +#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ + (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) +}; + +/** + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind + * and vm_unbind. + * + * These user fences can be input or output fences + * (See struct drm_i915_vm_bind_user_fence). + */ +struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** @fence_count: Number of elements in the @user_fence_ptr array. */ + __u64 fence_count; + + /** + * @user_fence_ptr: Pointer to an array of + * struct drm_i915_vm_bind_user_fence of length @fence_count. + */ + __u64 user_fence_ptr; +}; + +/** + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer + * gpu virtual addresses. + * + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension + * must always be appended in the VM_BIND mode and it will be an error to + * append this extension in older non-VM_BIND mode. + */ +struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** @count: Number of addresses in the addr array. */ + __u32 count; + + /** @addr: An array of batch gpu virtual addresses. */ + __u64 addr[0]; +}; + +/** + * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion + * signaling extension. + * + * This extension allows user to attach a user fence (@addr, @value pair) to an + * execbuf to be signaled by the command streamer after the completion of first + * level batch, by writing the @value at specified @addr and triggering an + * interrupt. + * User can either poll for this user fence to signal or can also wait on it + * with i915_gem_wait_user_fence ioctl. + * This is very much usefaul for long running contexts where waiting on dma-fence + * by user (like i915_gem_wait ioctl) is not supported. + */ +struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** + * @addr: User/Memory fence qword aligned GPU virtual address. + * + * Address has to be a valid GPU virtual address at the time of + * first level batch completion. + */ + __u64 addr; + + /** + * @value: User/Memory fence Value to be written to above address + * after first level batch completes. + */ + __u64 value; + + /** @rsvd: Reserved for future extensions, MBZ */ + __u64 rsvd; +}; + +/** + * struct drm_i915_gem_create_ext_vm_private - Extension to make the object + * private to the specified VM. + * + * See struct drm_i915_gem_create_ext. + */ +struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** @vm_id: Id of the VM to which the object is private */ + __u32 vm_id; +}; + +/** + * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence. + * + * User/Memory fence can be woken up either by: + * + * 1. GPU context indicated by @ctx_id, or, + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT. + * @ctx_id is ignored when this flag is set. + * + * Wakeup condition is, + * ``((*addr & mask) op (value & mask))`` + * + * See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>` + */ +struct drm_i915_gem_wait_user_fence { + /** @extensions: Zero-terminated chain of extensions. */ + __u64 extensions; + + /** @addr: User/Memory fence address */ + __u64 addr; + + /** @ctx_id: Id of the Context which will signal the fence. */ + __u32 ctx_id; + + /** @op: Wakeup condition operator */ + __u16 op; +#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7 + + /** + * @flags: Supported flags are, + * + * I915_UFENCE_WAIT_SOFT: + * + * To be woken up by i915 driver async worker (not by GPU). + * + * I915_UFENCE_WAIT_ABSTIME: + * + * Wait timeout specified as absolute time. + */ + __u16 flags; +#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2 + + /** @value: Wakeup value */ + __u64 value; + + /** @mask: Wakeup mask */ + __u64 mask; +#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull + + /** + * @timeout: Wait timeout in nanoseconds. + * + * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the + * absolute time in nsec. + */ + __s64 timeout; +};

Niranjana Vishwanathapura

23 May 23 May

7:19 p.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:

...

On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:

...
VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*

Copyright © 2022 Intel Corporation

*/

+/**

DOC: I915_PARAM_HAS_VM_BIND

VM_BIND feature availability.

See typedef drm_i915_getparam_t param.

*/

+#define I915_PARAM_HAS_VM_BIND 57

+/**

DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND

Flag to opt-in for VM_BIND mode of binding during VM creation.

See struct drm_i915_gem_vm_control flags.

A VM in VM_BIND mode will not support the older execbuff mode of binding.

In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the

&drm_i915_gem_execbuffer2.buffer_count must be 0).

Also, &drm_i915_gem_execbuffer2.batch_start_offset and

&drm_i915_gem_execbuffer2.batch_len must be 0.

DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided

to pass in the batch buffer addresses.

Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and

I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0

(not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be

set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

The buffers_ptr, buffer_count, batch_start_offset and batch_len fields

of struct drm_i915_gem_execbuffer2 are also not used and must be 0.

*/

From that description, it seems we have:

struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using extensions __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! __u64 flags; -> some flags must be 0 (new) __u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };

Based on that, why can't we just get drm_i915_gem_execbuffer3 instead of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how the VM was created (which is an entirely different ioctl).

From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll need to create a -11 entry just to deal with this ioctl.

The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).

Niranjana

...

+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0)

+/**

DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING

Flag to declare context as long running.

See struct drm_i915_gem_context_create_ext flags.

Usage of dma-fence expects that they complete in reasonable amount of time.

Compute on the other hand can be long running. Hence it is not appropriate

for compute contexts to export request completion dma-fence to user.

The dma-fence usage will be limited to in-kernel consumption only.

Compute contexts need to use user/memory fence.

So, long running contexts do not support output fences. Hence,

I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and

I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected

to be not used.

DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped

to long running contexts.

*/

+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2)

+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f

+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)

+/**

struct drm_i915_gem_vm_bind - VA to object mapping to bind.

This structure is passed to VM_BIND ioctl and specifies the mapping of GPU

virtual address (VA) range to the section of an object that should be bound

in the device page table of the specified address space (VM).

The VA range specified must be unique (ie., not currently bound) and can

be mapped to whole object or a section of the object (partial binding).

Multiple VA mappings can be created to the same section of the object

(aliasing).

*/

+struct drm_i915_gem_vm_bind {
  /** @vm_id: VM (address space) id to bind */
  __u32 vm_id;
  /** @handle: Object handle */
  __u32 handle;
  /** @start: Virtual Address start to bind */
  __u64 start;
  /** @offset: Offset in object to bind */
  __u64 offset;
  /** @length: Length of mapping to bind */
  __u64 length;
  /**
   * @flags: Supported flags are,
   *
   * I915_GEM_VM_BIND_READONLY:
   * Mapping is read-only.
   *
   * I915_GEM_VM_BIND_CAPTURE:
   * Capture this mapping in the dump upon GPU error.
   */
  __u64 flags;
+#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1)
  /** @extensions: 0-terminated chain of extensions for this mapping. */
  __u64 extensions;
+};

+/**

struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.

This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual

address (VA) range that should be unbound from the device page table of the

specified address space (VM). The specified VA range must match one of the

mappings created with the VM_BIND ioctl. TLB is flushed upon unbind

completion.

*/

+struct drm_i915_gem_vm_unbind {
  /** @vm_id: VM (address space) id to bind */
  __u32 vm_id;
  /** @rsvd: Reserved for future use; must be zero. */
  __u32 rsvd;
  /** @start: Virtual Address start to unbind */
  __u64 start;
  /** @length: Length of mapping to unbind */
  __u64 length;
  /** @flags: reserved for future usage, currently MBZ */
  __u64 flags;
  /** @extensions: 0-terminated chain of extensions for this mapping. */
  __u64 extensions;
+};

+/**

struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind

or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for input fence to signal

before starting the binding or unbinding.

The vm_bind or vm_unbind async worker will signal the returned output fence

after the completion of binding or unbinding.

*/

+struct drm_i915_vm_bind_fence {
  /** @handle: User's handle for a drm_syncobj to wait on or signal. */
  __u32 handle;
  /**
   * @flags: Supported flags are,
   *
   * I915_VM_BIND_FENCE_WAIT:
   * Wait for the input fence before binding/unbinding
   *
   * I915_VM_BIND_FENCE_SIGNAL:
   * Return bind/unbind completion fence as output
   */
  __u32 flags;
+#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};

+/**

struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind

and vm_unbind.

This structure describes an array of timeline drm_syncobj and associated

points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's

can be input or output fences (See struct drm_i915_vm_bind_fence).

*/

+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0
  /** @base: Extension link. See struct i915_user_extension. */
  struct i915_user_extension base;
  /**
   * @fence_count: Number of elements in the @handles_ptr & @value_ptr
   * arrays.
   */
  __u64 fence_count;
  /**
   * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
   * of length @fence_count.
   */
  __u64 handles_ptr;
  /**
   * @values_ptr: Pointer to an array of u64 values of length
   * @fence_count.
   * Values must be 0 for a binary drm_syncobj. A Value of 0 for a
   * timeline drm_syncobj is invalid as it turns a drm_syncobj into a
   * binary one.
   */
  __u64 values_ptr;
+};

+/**

struct drm_i915_vm_bind_user_fence - An input or output user fence for the

vm_bind or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for the input fence (value at

@addr to become equal to @val) before starting the binding or unbinding.

The vm_bind or vm_unbind async worker will signal the output fence after

the completion of binding or unbinding by writing @val to memory location at

@addr

*/

+struct drm_i915_vm_bind_user_fence {
  /** @addr: User/Memory fence qword aligned process virtual address */
  __u64 addr;
  /** @val: User/Memory fence value to be written after bind completion */
  __u64 val;
  /**
   * @flags: Supported flags are,
   *
   * I915_VM_BIND_USER_FENCE_WAIT:
   * Wait for the input fence before binding/unbinding
   *
   * I915_VM_BIND_USER_FENCE_SIGNAL:
   * Return bind/unbind completion fence as output
   */
  __u32 flags;
+#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \
  (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))
+};

+/**

struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind

and vm_unbind.

These user fences can be input or output fences

(See struct drm_i915_vm_bind_user_fence).

*/

+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1
  /** @base: Extension link. See struct i915_user_extension. */
  struct i915_user_extension base;
  /** @fence_count: Number of elements in the @user_fence_ptr array. */
  __u64 fence_count;
  /**
   * @user_fence_ptr: Pointer to an array of
   * struct drm_i915_vm_bind_user_fence of length @fence_count.
   */
  __u64 user_fence_ptr;
+};

+/**

struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer

gpu virtual addresses.

In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension

must always be appended in the VM_BIND mode and it will be an error to

append this extension in older non-VM_BIND mode.

*/

+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1
  /** @base: Extension link. See struct i915_user_extension. */
  struct i915_user_extension base;
  /** @count: Number of addresses in the addr array. */
  __u32 count;
  /** @addr: An array of batch gpu virtual addresses. */
  __u64 addr[0];
+};

+/**

struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion

signaling extension.

This extension allows user to attach a user fence (@addr, @value pair) to an

execbuf to be signaled by the command streamer after the completion of first

level batch, by writing the @value at specified @addr and triggering an

interrupt.

User can either poll for this user fence to signal or can also wait on it

with i915_gem_wait_user_fence ioctl.

This is very much usefaul for long running contexts where waiting on dma-fence

by user (like i915_gem_wait ioctl) is not supported.

*/

+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2
  /** @base: Extension link. See struct i915_user_extension. */
  struct i915_user_extension base;
  /**
   * @addr: User/Memory fence qword aligned GPU virtual address.
   *
   * Address has to be a valid GPU virtual address at the time of
   * first level batch completion.
   */
  __u64 addr;
  /**
   * @value: User/Memory fence Value to be written to above address
   * after first level batch completes.
   */
  __u64 value;
  /** @rsvd: Reserved for future extensions, MBZ */
  __u64 rsvd;
+};

+/**

struct drm_i915_gem_create_ext_vm_private - Extension to make the object

private to the specified VM.

See struct drm_i915_gem_create_ext.

*/

+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2
  /** @base: Extension link. See struct i915_user_extension. */
  struct i915_user_extension base;
  /** @vm_id: Id of the VM to which the object is private */
  __u32 vm_id;
+};

+/**

struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.

User/Memory fence can be woken up either by:

GPU context indicated by @ctx_id, or,

Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.

@ctx_id is ignored when this flag is set.

Wakeup condition is,

``((*addr & mask) op (value & mask))``

See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`

*/

+struct drm_i915_gem_wait_user_fence {
  /** @extensions: Zero-terminated chain of extensions. */
  __u64 extensions;
  /** @addr: User/Memory fence address */
  __u64 addr;
  /** @ctx_id: Id of the Context which will signal the fence. */
  __u32 ctx_id;
  /** @op: Wakeup condition operator */
  __u16 op;
+#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7
  /**
   * @flags: Supported flags are,
   *
   * I915_UFENCE_WAIT_SOFT:
   *
   * To be woken up by i915 driver async worker (not by GPU).
   *
   * I915_UFENCE_WAIT_ABSTIME:
   *
   * Wait timeout specified as absolute time.
   */
  __u16 flags;
+#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2
  /** @value: Wakeup value */
  __u64 value;
  /** @mask: Wakeup mask */
  __u64 mask;
+#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull
  /**
   * @timeout: Wait timeout in nanoseconds.
   *
   * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
   * absolute time in nsec.
   */
  __s64 timeout;
+};

Dave Airlie

1 Jun 1 Jun

9:02 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

...

On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:

...
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:

...
VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*

Copyright © 2022 Intel Corporation

*/

+/**

DOC: I915_PARAM_HAS_VM_BIND

VM_BIND feature availability.

See typedef drm_i915_getparam_t param.

*/

+#define I915_PARAM_HAS_VM_BIND 57

+/**

DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND

Flag to opt-in for VM_BIND mode of binding during VM creation.

See struct drm_i915_gem_vm_control flags.

A VM in VM_BIND mode will not support the older execbuff mode of binding.

In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the

&drm_i915_gem_execbuffer2.buffer_count must be 0).

Also, &drm_i915_gem_execbuffer2.batch_start_offset and

&drm_i915_gem_execbuffer2.batch_len must be 0.

DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided

to pass in the batch buffer addresses.

Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and

I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0

(not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be

set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

The buffers_ptr, buffer_count, batch_start_offset and batch_len fields

of struct drm_i915_gem_execbuffer2 are also not used and must be 0.

*/

From that description, it seems we have:

struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using extensions __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! __u64 flags; -> some flags must be 0 (new) __u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };

Based on that, why can't we just get drm_i915_gem_execbuffer3 instead of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how the VM was created (which is an entirely different ioctl).

From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll need to create a -11 entry just to deal with this ioctl.

The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different than adding new ioctls.

I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.

Dave.

Daniel Vetter

9:27 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:

...

On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

...
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:

...
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:

...
VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*

Copyright © 2022 Intel Corporation

*/

+/**

DOC: I915_PARAM_HAS_VM_BIND

VM_BIND feature availability.

See typedef drm_i915_getparam_t param.

*/

+#define I915_PARAM_HAS_VM_BIND 57

+/**

DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND

Flag to opt-in for VM_BIND mode of binding during VM creation.

See struct drm_i915_gem_vm_control flags.

A VM in VM_BIND mode will not support the older execbuff mode of binding.

In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the

&drm_i915_gem_execbuffer2.buffer_count must be 0).

Also, &drm_i915_gem_execbuffer2.batch_start_offset and

&drm_i915_gem_execbuffer2.batch_len must be 0.

DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided

to pass in the batch buffer addresses.

Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and

I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0

(not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be

set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

The buffers_ptr, buffer_count, batch_start_offset and batch_len fields

of struct drm_i915_gem_execbuffer2 are also not used and must be 0.

*/

From that description, it seems we have:

struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using extensions __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! __u64 flags; -> some flags must be 0 (new) __u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };

Based on that, why can't we just get drm_i915_gem_execbuffer3 instead of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how the VM was created (which is an entirely different ioctl).

From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll need to create a -11 entry just to deal with this ioctl.

The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different than adding new ioctls.

I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.

So maybe we should do this? -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Niranjana Vishwanathapura

2 Jun 2 Jun

5:08 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

...

On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:

...
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

...
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:

...
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:

...
VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*

Copyright © 2022 Intel Corporation

*/

+/**

DOC: I915_PARAM_HAS_VM_BIND

VM_BIND feature availability.

See typedef drm_i915_getparam_t param.

*/

+#define I915_PARAM_HAS_VM_BIND 57

+/**

DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND

Flag to opt-in for VM_BIND mode of binding during VM creation.

See struct drm_i915_gem_vm_control flags.

A VM in VM_BIND mode will not support the older execbuff mode of binding.

In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the

&drm_i915_gem_execbuffer2.buffer_count must be 0).

Also, &drm_i915_gem_execbuffer2.batch_start_offset and

&drm_i915_gem_execbuffer2.batch_len must be 0.

DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided

to pass in the batch buffer addresses.

Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and

I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0

(not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be

set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

The buffers_ptr, buffer_count, batch_start_offset and batch_len fields

of struct drm_i915_gem_execbuffer2 are also not used and must be 0.

*/

From that description, it seems we have:

struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using extensions __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! __u64 flags; -> some flags must be 0 (new) __u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };

Based on that, why can't we just get drm_i915_gem_execbuffer3 instead of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how the VM was created (which is an entirely different ioctl).

From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll need to create a -11 entry just to deal with this ioctl.

The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different than adding new ioctls.

I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.

So maybe we should do this?

Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Niranjana

...

-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Niranjana Vishwanathapura

3 Jun 3 Jun

6:53 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:

...

On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

...
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:

...
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

...
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:

...
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote:

...
VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*

Copyright © 2022 Intel Corporation

*/

+/**

DOC: I915_PARAM_HAS_VM_BIND

VM_BIND feature availability.

See typedef drm_i915_getparam_t param.

*/

+#define I915_PARAM_HAS_VM_BIND 57

+/**

DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND

Flag to opt-in for VM_BIND mode of binding during VM creation.

See struct drm_i915_gem_vm_control flags.

A VM in VM_BIND mode will not support the older execbuff mode of binding.

In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the

&drm_i915_gem_execbuffer2.buffer_count must be 0).

Also, &drm_i915_gem_execbuffer2.batch_start_offset and

&drm_i915_gem_execbuffer2.batch_len must be 0.

DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided

to pass in the batch buffer addresses.

Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and

I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0

(not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be

set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

The buffers_ptr, buffer_count, batch_start_offset and batch_len fields

of struct drm_i915_gem_execbuffer2 are also not used and must be 0.

*/

From that description, it seems we have:

struct drm_i915_gem_execbuffer2 { __u64 buffers_ptr; -> must be 0 (new) __u32 buffer_count; -> must be 0 (new) __u32 batch_start_offset; -> must be 0 (new) __u32 batch_len; -> must be 0 (new) __u32 DR1; -> must be 0 (old) __u32 DR4; -> must be 0 (old) __u32 num_cliprects; (fences) -> must be 0 since using extensions __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! __u64 flags; -> some flags must be 0 (new) __u64 rsvd1; (context info) -> repurposed field (old) __u64 rsvd2; -> unused };

Based on that, why can't we just get drm_i915_gem_execbuffer3 instead of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how the VM was created (which is an entirely different ioctl).

From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll need to create a -11 entry just to deal with this ioctl.

The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different than adding new ioctls.

I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.

So maybe we should do this?

Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Does this sound reasonable?

struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */

__u32 batch_count; __u64 batch_addr_ptr; /* Pointer to an array of batch gpu virtual addresses */

__u64 flags; #define I915_EXEC3_RING_MASK (0x3f) #define I915_EXEC3_DEFAULT (0<<0) #define I915_EXEC3_RENDER (1<<0) #define I915_EXEC3_BSD (2<<0) #define I915_EXEC3_BLT (3<<0) #define I915_EXEC3_VEBOX (4<<0)

#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)

#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)

#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11) #define I915_EXEC3_FENCE_SUBMIT (1<<12)

__u64 in_out_fence; /* previously execbuffer2.rsvd2 */

__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };

With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

...

Niranjana

...
-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Tvrtko Ursulin

7 Jun 7 Jun

10:42 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:

...

On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

...
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:

...
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

...
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:

...
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: > VM_BIND and related uapi definitions > > v2: Ensure proper kernel-doc formatting with cross references. >     Also add new uapi and documentation as per review comments >     from Daniel. > > Signed-off-by: Niranjana Vishwanathapura

niranjana.vishwanathapura@intel.com

...
> --- > Documentation/gpu/rfc/i915_vm_bind.h | 399

+++++++++++++++++++++++++++

...
> 1 file changed, 399 insertions(+) > create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h > > diff --git a/Documentation/gpu/rfc/i915_vm_bind.h

b/Documentation/gpu/rfc/i915_vm_bind.h

...
> new file mode 100644 > index 000000000000..589c0a009107 > --- /dev/null > +++ b/Documentation/gpu/rfc/i915_vm_bind.h > @@ -0,0 +1,399 @@ > +/* SPDX-License-Identifier: MIT */ > +/* > + * Copyright © 2022 Intel Corporation > + */ > + > +/** > + * DOC: I915_PARAM_HAS_VM_BIND > + * > + * VM_BIND feature availability. > + * See typedef drm_i915_getparam_t param. > + */ > +#define I915_PARAM_HAS_VM_BIND               57 > + > +/** > + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND > + * > + * Flag to opt-in for VM_BIND mode of binding during VM creation. > + * See struct drm_i915_gem_vm_control flags. > + * > + * A VM in VM_BIND mode will not support the older execbuff

mode of binding.

...
> + * In VM_BIND mode, execbuff ioctl will not accept any execlist

(ie., the

...
> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). > + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and > + * &drm_i915_gem_execbuffer2.batch_len must be 0. > + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must

be provided

...
> + * to pass in the batch buffer addresses. > + * > + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and > + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags

must be 0

...
> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag

must always be

...
> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). > + * The buffers_ptr, buffer_count, batch_start_offset and

batch_len fields

...
> + * of struct drm_i915_gem_execbuffer2 are also not used and

must be 0.

...
> + */

From that description, it seems we have:

struct drm_i915_gem_execbuffer2 {         __u64 buffers_ptr;              -> must be 0 (new)         __u32 buffer_count;             -> must be 0 (new)         __u32 batch_start_offset;       -> must be 0 (new)         __u32 batch_len;                -> must be 0 (new)         __u32 DR1;                      -> must be 0 (old)         __u32 DR4;                      -> must be 0 (old)         __u32 num_cliprects; (fences)   -> must be 0 since using

extensions

...
__u64 cliprects_ptr; (fences, extensions) -> contains an

actual pointer!

...
__u64 flags;                    -> some flags must be 0 (new)         __u64 rsvd1; (context info)     -> repurposed field (old)         __u64 rsvd2;                    -> unused };

Based on that, why can't we just get drm_i915_gem_execbuffer3 instead of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how

the VM

...
was created (which is an entirely different ioctl).

From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll need to create a -11 entry just to deal with this ioctl.

The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different than adding new ioctls.

I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.

So maybe we should do this?

Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Does this sound reasonable?

struct drm_i915_gem_execbuffer3 {        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */

__u32 batch_count;        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu virtual addresses */

Casual stumble upon..

Alternatively you could embed N pointers to make life a bit easier for both userspace and kernel side. Yes, but then "N batch buffers should be enough for everyone" problem.. :)

...

__u64 flags; #define I915_EXEC3_RING_MASK              (0x3f) #define I915_EXEC3_DEFAULT                (0<<0) #define I915_EXEC3_RENDER                 (1<<0) #define I915_EXEC3_BSD                    (2<<0) #define I915_EXEC3_BLT                    (3<<0) #define I915_EXEC3_VEBOX                  (4<<0)

#define I915_EXEC3_SECURE               (1<<6) #define I915_EXEC3_IS_PINNED            (1<<7)

#define I915_EXEC3_BSD_SHIFT     (8) #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)

I'd suggest legacy engine selection is unwanted, especially not with the convoluted BSD1/2 flags. Can we just require context with engine map and index? Or if default context has to be supported then I'd suggest ...class_instance for that mode.

...

#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11) #define I915_EXEC3_FENCE_SUBMIT (1<<12)

People are likely to object to submit fence since generic mechanism to align submissions was rejected.

...

__u64 in_out_fence; /* previously execbuffer2.rsvd2 */

New ioctl you can afford dedicated fields.

In any case I suggest you involve UMD folks in designing it.

Regards,

Tvrtko

...

__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };

With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

...
Niranjana

...
-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Niranjana Vishwanathapura

9:25 p.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On Tue, Jun 07, 2022 at 11:42:08AM +0100, Tvrtko Ursulin wrote:

...

On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

...
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:

...
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

...
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: >> VM_BIND and related uapi definitions >> >> v2: Ensure proper kernel-doc formatting with cross references. >>     Also add new uapi and documentation as per review comments >>     from Daniel. >> >> Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com >> --- >> Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ >> 1 file changed, 399 insertions(+) >> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h >> new file mode 100644 >> index 000000000000..589c0a009107 >> --- /dev/null >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >> @@ -0,0 +1,399 @@ >> +/* SPDX-License-Identifier: MIT */ >> +/* >> + * Copyright © 2022 Intel Corporation >> + */ >> + >> +/** >> + * DOC: I915_PARAM_HAS_VM_BIND >> + * >> + * VM_BIND feature availability. >> + * See typedef drm_i915_getparam_t param. >> + */ >> +#define I915_PARAM_HAS_VM_BIND               57 >> + >> +/** >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >> + * >> + * Flag to opt-in for VM_BIND mode of binding during VM creation. >> + * See struct drm_i915_gem_vm_control flags. >> + * >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding. >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided >> + * to pass in the batch buffer addresses. >> + * >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0 >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0. >> + */ > >From that description, it seems we have: > >struct drm_i915_gem_execbuffer2 { >        __u64 buffers_ptr;              -> must be 0 (new) >        __u32 buffer_count;             -> must be 0 (new) >        __u32 batch_start_offset;       -> must be 0 (new) >        __u32 batch_len;                -> must be 0 (new) >        __u32 DR1;                      -> must be 0 (old) >        __u32 DR4;                      -> must be 0 (old) >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! >        __u64 flags;                    -> some flags must be 0 (new) >        __u64 rsvd1; (context info)     -> repurposed field (old) >        __u64 rsvd2;                    -> unused >}; > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead >of adding even more complexity to an already abused interface? While >the Vulkan-like extension thing is really nice, I don't think what >we're doing here is extending the ioctl usage, we're completely >changing how the base struct should be interpreted based on how the VM >was created (which is an entirely different ioctl). > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is >already at -6 without these changes. I think after vm_bind we'll need >to create a -11 entry just to deal with this ioctl. >

The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different than adding new ioctls.

I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.

So maybe we should do this?

Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Does this sound reasonable?

struct drm_i915_gem_execbuffer3 {        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */

__u32 batch_count;        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu virtual addresses */

Casual stumble upon..

Alternatively you could embed N pointers to make life a bit easier for both userspace and kernel side. Yes, but then "N batch buffers should be enough for everyone" problem.. :)

Thanks Tvrtko, Yes, hence the batch_addr_ptr.

...

...
__u64 flags; #define I915_EXEC3_RING_MASK              (0x3f) #define I915_EXEC3_DEFAULT                (0<<0) #define I915_EXEC3_RENDER                 (1<<0) #define I915_EXEC3_BSD                    (2<<0) #define I915_EXEC3_BLT                    (3<<0) #define I915_EXEC3_VEBOX                  (4<<0)

#define I915_EXEC3_SECURE               (1<<6) #define I915_EXEC3_IS_PINNED            (1<<7)

#define I915_EXEC3_BSD_SHIFT     (8) #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)

I'd suggest legacy engine selection is unwanted, especially not with the convoluted BSD1/2 flags. Can we just require context with engine map and index? Or if default context has to be supported then I'd suggest ...class_instance for that mode.

Ok, I will be happy to remove it and only support contexts with engine map, if UMDs agree on that.

...

...
#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11) #define I915_EXEC3_FENCE_SUBMIT (1<<12)

People are likely to object to submit fence since generic mechanism to align submissions was rejected.

Ok, again, I can remove it if UMDs are ok with it.

...

...
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */

New ioctl you can afford dedicated fields.

Yes, but as I asked below, I am not sure if we need this or the timeline fence arry extension we have is good enough.

...

In any case I suggest you involve UMD folks in designing it.

Yah. Paulo, Lionel, Jason, Daniel, can you comment on these regarding what will UMD need in execbuf3 and what can be removed?

Thanks, Niranjana

...

Regards,

Tvrtko

...
__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };

With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

...
Niranjana

...
-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Tvrtko Ursulin

8 Jun 8 Jun

7:34 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 07/06/2022 22:25, Niranjana Vishwanathapura wrote:

...

On Tue, Jun 07, 2022 at 11:42:08AM +0100, Tvrtko Ursulin wrote:

...
On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

...
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:

...
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote: > > On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >> On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: >>> VM_BIND and related uapi definitions >>> >>> v2: Ensure proper kernel-doc formatting with cross references. >>>      Also add new uapi and documentation as per review comments >>>      from Daniel. >>> >>> Signed-off-by: Niranjana Vishwanathapura > niranjana.vishwanathapura@intel.com >>> --- >>> Documentation/gpu/rfc/i915_vm_bind.h | 399 > +++++++++++++++++++++++++++ >>> 1 file changed, 399 insertions(+) >>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >>> >>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h > b/Documentation/gpu/rfc/i915_vm_bind.h >>> new file mode 100644 >>> index 000000000000..589c0a009107 >>> --- /dev/null >>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >>> @@ -0,0 +1,399 @@ >>> +/* SPDX-License-Identifier: MIT */ >>> +/* >>> + * Copyright © 2022 Intel Corporation >>> + */ >>> + >>> +/** >>> + * DOC: I915_PARAM_HAS_VM_BIND >>> + * >>> + * VM_BIND feature availability. >>> + * See typedef drm_i915_getparam_t param. >>> + */ >>> +#define I915_PARAM_HAS_VM_BIND               57 >>> + >>> +/** >>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >>> + * >>> + * Flag to opt-in for VM_BIND mode of binding during VM creation. >>> + * See struct drm_i915_gem_vm_control flags. >>> + * >>> + * A VM in VM_BIND mode will not support the older > execbuff mode of binding. >>> + * In VM_BIND mode, execbuff ioctl will not accept any > execlist (ie., the >>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >>> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension > must be provided >>> + * to pass in the batch buffer addresses. >>> + * >>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >>> + * I915_EXEC_BATCH_FIRST of > &drm_i915_gem_execbuffer2.flags must be 0 >>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS > flag must always be >>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >>> + * The buffers_ptr, buffer_count, batch_start_offset and > batch_len fields >>> + * of struct drm_i915_gem_execbuffer2 are also not used > and must be 0. >>> + */ >> >> From that description, it seems we have: >> >> struct drm_i915_gem_execbuffer2 { >>         __u64 buffers_ptr;              -> must be 0 (new) >>         __u32 buffer_count;             -> must be 0 (new) >>         __u32 batch_start_offset;       -> must be 0 (new) >>         __u32 batch_len;                -> must be 0 (new) >>         __u32 DR1;                      -> must be 0 (old) >>         __u32 DR4;                      -> must be 0 (old) >>         __u32 num_cliprects; (fences)   -> must be 0 since > using extensions >>         __u64 cliprects_ptr; (fences, extensions) -> > contains an actual pointer! >>         __u64 flags;                    -> some flags must be 0 >> (new) >>         __u64 rsvd1; (context info)     -> repurposed field (old) >>         __u64 rsvd2;                    -> unused >> }; >> >> Based on that, why can't we just get drm_i915_gem_execbuffer3 >> instead >> of adding even more complexity to an already abused interface? >> While >> the Vulkan-like extension thing is really nice, I don't think what >> we're doing here is extending the ioctl usage, we're completely >> changing how the base struct should be interpreted based on > how the VM >> was created (which is an entirely different ioctl). >> >> From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is >> already at -6 without these changes. I think after vm_bind we'll >> need >> to create a -11 entry just to deal with this ioctl. >> > > The only change here is removing the execlist support for VM_BIND > mode (other than natual extensions). > Adding a new execbuffer3 was considered, but I think we need to > be careful > with that as that goes beyond the VM_BIND support, including any > future > requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different than adding new ioctls.

I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.

So maybe we should do this?

Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Does this sound reasonable?

struct drm_i915_gem_execbuffer3 {        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */

__u32 batch_count;        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu virtual addresses */

Casual stumble upon..

Alternatively you could embed N pointers to make life a bit easier for both userspace and kernel side. Yes, but then "N batch buffers should be enough for everyone" problem.. :)

Thanks Tvrtko, Yes, hence the batch_addr_ptr.

Right, but then userspace has to allocate a separate buffer and kernel has to access it separately from a single copy_from_user. Pros and cons of "this many batches should be enough for everyone" versus the extra operations.

Hmm.. for the common case of one batch - you could define the uapi to say if batch_count is one then pointer is GPU VA to the batch itself, not a pointer to userspace array of GPU VA?

Regards,

Tvrtko

...

...
...
__u64 flags; #define I915_EXEC3_RING_MASK              (0x3f) #define I915_EXEC3_DEFAULT                (0<<0) #define I915_EXEC3_RENDER                 (1<<0) #define I915_EXEC3_BSD                    (2<<0) #define I915_EXEC3_BLT                    (3<<0) #define I915_EXEC3_VEBOX                  (4<<0)

#define I915_EXEC3_SECURE               (1<<6) #define I915_EXEC3_IS_PINNED            (1<<7)

#define I915_EXEC3_BSD_SHIFT     (8) #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)

I'd suggest legacy engine selection is unwanted, especially not with the convoluted BSD1/2 flags. Can we just require context with engine map and index? Or if default context has to be supported then I'd suggest ...class_instance for that mode.

Ok, I will be happy to remove it and only support contexts with engine map, if UMDs agree on that.

...
...
#define I915_EXEC3_FENCE_IN             (1<<10) #define I915_EXEC3_FENCE_OUT            (1<<11) #define I915_EXEC3_FENCE_SUBMIT         (1<<12)

People are likely to object to submit fence since generic mechanism to align submissions was rejected.

Ok, again, I can remove it if UMDs are ok with it.

...
...
__u64 in_out_fence;        /* previously execbuffer2.rsvd2 */

New ioctl you can afford dedicated fields.

Yes, but as I asked below, I am not sure if we need this or the timeline fence arry extension we have is good enough.

...
In any case I suggest you involve UMD folks in designing it.

Yah. Paulo, Lionel, Jason, Daniel, can you comment on these regarding what will UMD need in execbuf3 and what can be removed?

Thanks, Niranjana

...
Regards,

Tvrtko

...
__u64 extensions;        /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };

With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

...
Niranjana

...
-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Niranjana Vishwanathapura

7:52 p.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On Wed, Jun 08, 2022 at 08:34:36AM +0100, Tvrtko Ursulin wrote:

...

On 07/06/2022 22:25, Niranjana Vishwanathapura wrote:

...
On Tue, Jun 07, 2022 at 11:42:08AM +0100, Tvrtko Ursulin wrote:

...
On 03/06/2022 07:53, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

...
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote: > >On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura >niranjana.vishwanathapura@intel.com wrote: >> >>On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >>>On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: >>>>VM_BIND and related uapi definitions >>>> >>>>v2: Ensure proper kernel-doc formatting with cross references. >>>>     Also add new uapi and documentation as per review comments >>>>     from Daniel. >>>> >>>>Signed-off-by: Niranjana Vishwanathapura >>niranjana.vishwanathapura@intel.com >>>>--- >>>> Documentation/gpu/rfc/i915_vm_bind.h | 399 >>+++++++++++++++++++++++++++ >>>> 1 file changed, 399 insertions(+) >>>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >>>> >>>>diff --git a/Documentation/gpu/rfc/i915_vm_bind.h >>b/Documentation/gpu/rfc/i915_vm_bind.h >>>>new file mode 100644 >>>>index 000000000000..589c0a009107 >>>>--- /dev/null >>>>+++ b/Documentation/gpu/rfc/i915_vm_bind.h >>>>@@ -0,0 +1,399 @@ >>>>+/* SPDX-License-Identifier: MIT */ >>>>+/* >>>>+ * Copyright © 2022 Intel Corporation >>>>+ */ >>>>+ >>>>+/** >>>>+ * DOC: I915_PARAM_HAS_VM_BIND >>>>+ * >>>>+ * VM_BIND feature availability. >>>>+ * See typedef drm_i915_getparam_t param. >>>>+ */ >>>>+#define I915_PARAM_HAS_VM_BIND               57 >>>>+ >>>>+/** >>>>+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >>>>+ * >>>>+ * Flag to opt-in for VM_BIND mode of binding during VM creation. >>>>+ * See struct drm_i915_gem_vm_control flags. >>>>+ * >>>>+ * A VM in VM_BIND mode will not support the older >>execbuff mode of binding. >>>>+ * In VM_BIND mode, execbuff ioctl will not accept >>>>any >>execlist (ie., the >>>>+ * &drm_i915_gem_execbuffer2.buffer_count must be 0). >>>>+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >>>>+ * &drm_i915_gem_execbuffer2.batch_len must be 0. >>>>+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES >>>>extension >>must be provided >>>>+ * to pass in the batch buffer addresses. >>>>+ * >>>>+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >>>>+ * I915_EXEC_BATCH_FIRST of >>&drm_i915_gem_execbuffer2.flags must be 0 >>>>+ * (not used) in VM_BIND mode. >>>>I915_EXEC_USE_EXTENSIONS >>flag must always be >>>>+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >>>>+ * The buffers_ptr, buffer_count, >>>>batch_start_offset and >>batch_len fields >>>>+ * of struct drm_i915_gem_execbuffer2 are also not >>>>used >>and must be 0. >>>>+ */ >>> >>>From that description, it seems we have: >>> >>>struct drm_i915_gem_execbuffer2 { >>>        __u64 buffers_ptr;              -> must be 0 (new) >>>        __u32 buffer_count;             -> must be 0 (new) >>>        __u32 batch_start_offset;       -> must be 0 (new) >>>        __u32 batch_len;                -> must be 0 (new) >>>        __u32 DR1;                      -> must be 0 (old) >>>        __u32 DR4;                      -> must be 0 (old) >>>        __u32 num_cliprects; (fences)   -> must be 0 >>>since >>using extensions >>>        __u64 cliprects_ptr; (fences, extensions) -> >>contains an actual pointer! >>>        __u64 flags;                    -> some flags >>>must be 0 (new) >>>        __u64 rsvd1; (context info)     -> repurposed field (old) >>>        __u64 rsvd2;                    -> unused >>>}; >>> >>>Based on that, why can't we just get >>>drm_i915_gem_execbuffer3 instead >>>of adding even more complexity to an already abused >>>interface? While >>>the Vulkan-like extension thing is really nice, I don't think what >>>we're doing here is extending the ioctl usage, we're completely >>>changing how the base struct should be interpreted >>>based on >>how the VM >>>was created (which is an entirely different ioctl). >>> >>>From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is >>>already at -6 without these changes. I think after >>>vm_bind we'll need >>>to create a -11 entry just to deal with this ioctl. >>> >> >>The only change here is removing the execlist support for VM_BIND >>mode (other than natual extensions). >>Adding a new execbuffer3 was considered, but I think we >>need to be careful >>with that as that goes beyond the VM_BIND support, >>including any future >>requirements (as we don't want an execbuffer4 after VM_BIND). > >Why not? it's not like adding extensions here is really >that different >than adding new ioctls. > >I definitely think this deserves an execbuffer3 without even >considering future requirements. Just to burn down the old >requirements and pointless fields. > >Make execbuffer3 be vm bind only, no relocs, no legacy >bits, leave the >older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.

So maybe we should do this?

Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Does this sound reasonable?

struct drm_i915_gem_execbuffer3 {        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */

__u32 batch_count;        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu virtual addresses */

Casual stumble upon..

Alternatively you could embed N pointers to make life a bit easier for both userspace and kernel side. Yes, but then "N batch buffers should be enough for everyone" problem.. :)

Thanks Tvrtko, Yes, hence the batch_addr_ptr.

Right, but then userspace has to allocate a separate buffer and kernel has to access it separately from a single copy_from_user. Pros and cons of "this many batches should be enough for everyone" versus the extra operations.

Hmm.. for the common case of one batch - you could define the uapi to say if batch_count is one then pointer is GPU VA to the batch itself, not a pointer to userspace array of GPU VA?

Yah, we can do that. ie., batch_addr_ptr is the batch VA when batch_count is 1. Otherwise, it is pointer to an array of batch VAs.

Other option is to move multi-batch support to an extension and here we will only have batch_addr (ie., support for 1 batch only).

I like the former one better (the one you suggested).

Niranjana

...

Regards,

Tvrtko

...
...
...
__u64 flags; #define I915_EXEC3_RING_MASK              (0x3f) #define I915_EXEC3_DEFAULT                (0<<0) #define I915_EXEC3_RENDER                 (1<<0) #define I915_EXEC3_BSD                    (2<<0) #define I915_EXEC3_BLT                    (3<<0) #define I915_EXEC3_VEBOX                  (4<<0)

#define I915_EXEC3_SECURE               (1<<6) #define I915_EXEC3_IS_PINNED            (1<<7)

#define I915_EXEC3_BSD_SHIFT     (8) #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)

I'd suggest legacy engine selection is unwanted, especially not with the convoluted BSD1/2 flags. Can we just require context with engine map and index? Or if default context has to be supported then I'd suggest ...class_instance for that mode.

Ok, I will be happy to remove it and only support contexts with engine map, if UMDs agree on that.

...
...
#define I915_EXEC3_FENCE_IN             (1<<10) #define I915_EXEC3_FENCE_OUT            (1<<11) #define I915_EXEC3_FENCE_SUBMIT         (1<<12)

People are likely to object to submit fence since generic mechanism to align submissions was rejected.

Ok, again, I can remove it if UMDs are ok with it.

...
...
__u64 in_out_fence;        /* previously execbuffer2.rsvd2 */

New ioctl you can afford dedicated fields.

Yes, but as I asked below, I am not sure if we need this or the timeline fence arry extension we have is good enough.

...
In any case I suggest you involve UMD folks in designing it.

Yah. Paulo, Lionel, Jason, Daniel, can you comment on these regarding what will UMD need in execbuf3 and what can be removed?

Thanks, Niranjana

...
Regards,

Tvrtko

...
__u64 extensions;        /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };

With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

...
Niranjana

...
-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Lionel Landwerlin

6:40 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:

...

On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

...
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:

...
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

...
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:

...
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: > VM_BIND and related uapi definitions > > v2: Ensure proper kernel-doc formatting with cross references. >     Also add new uapi and documentation as per review comments >     from Daniel. > > Signed-off-by: Niranjana Vishwanathapura

niranjana.vishwanathapura@intel.com

...
> --- > Documentation/gpu/rfc/i915_vm_bind.h | 399

+++++++++++++++++++++++++++

...
> 1 file changed, 399 insertions(+) > create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h > > diff --git a/Documentation/gpu/rfc/i915_vm_bind.h

b/Documentation/gpu/rfc/i915_vm_bind.h

...
> new file mode 100644 > index 000000000000..589c0a009107 > --- /dev/null > +++ b/Documentation/gpu/rfc/i915_vm_bind.h > @@ -0,0 +1,399 @@ > +/* SPDX-License-Identifier: MIT */ > +/* > + * Copyright © 2022 Intel Corporation > + */ > + > +/** > + * DOC: I915_PARAM_HAS_VM_BIND > + * > + * VM_BIND feature availability. > + * See typedef drm_i915_getparam_t param. > + */ > +#define I915_PARAM_HAS_VM_BIND               57 > + > +/** > + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND > + * > + * Flag to opt-in for VM_BIND mode of binding during VM creation. > + * See struct drm_i915_gem_vm_control flags. > + * > + * A VM in VM_BIND mode will not support the older execbuff

mode of binding.

...
> + * In VM_BIND mode, execbuff ioctl will not accept any

execlist (ie., the

...
> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). > + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and > + * &drm_i915_gem_execbuffer2.batch_len must be 0. > + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must

be provided

...
> + * to pass in the batch buffer addresses. > + * > + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and > + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags

must be 0

...
> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag

must always be

...
> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). > + * The buffers_ptr, buffer_count, batch_start_offset and

batch_len fields

...
> + * of struct drm_i915_gem_execbuffer2 are also not used and

must be 0.

...
> + */

From that description, it seems we have:

struct drm_i915_gem_execbuffer2 {         __u64 buffers_ptr;              -> must be 0 (new)         __u32 buffer_count;             -> must be 0 (new)         __u32 batch_start_offset;       -> must be 0 (new)         __u32 batch_len;                -> must be 0 (new)         __u32 DR1;                      -> must be 0 (old)         __u32 DR4;                      -> must be 0 (old)         __u32 num_cliprects; (fences)   -> must be 0 since using

extensions

...
__u64 cliprects_ptr; (fences, extensions) -> contains an

actual pointer!

...
__u64 flags;                    -> some flags must be 0

(new)

...
__u64 rsvd1; (context info)     -> repurposed field (old)         __u64 rsvd2;                    -> unused };

Based on that, why can't we just get drm_i915_gem_execbuffer3

instead

...
of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how

the VM

...
was created (which is an entirely different ioctl).

From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll

need

...
to create a -11 entry just to deal with this ioctl.

The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different than adding new ioctls.

I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.

So maybe we should do this?

Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Does this sound reasonable?

Thanks for proposing this. Some comments below.

...

struct drm_i915_gem_execbuffer3 {        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */

__u32 batch_count;        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu virtual addresses */

__u64 flags; #define I915_EXEC3_RING_MASK              (0x3f) #define I915_EXEC3_DEFAULT                (0<<0) #define I915_EXEC3_RENDER                 (1<<0) #define I915_EXEC3_BSD                    (2<<0) #define I915_EXEC3_BLT                    (3<<0) #define I915_EXEC3_VEBOX                  (4<<0)

Shouldn't we use the new engine selection uAPI instead?

We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in drm_i915_gem_context_create_ext_setparam.

And you can also create virtual engines with the same extension.

It feels like this could be a single u32 with the engine index (in the context engine map).

...

#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)

What's the meaning of PINNED?

...

#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)

#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11)

For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.

So there isn't much point for FENCE_IN/OUT.

Maybe check with other UMDs?

...

#define I915_EXEC3_FENCE_SUBMIT (1<<12)

What's FENCE_SUBMIT?

...

__u64 in_out_fence; /* previously execbuffer2.rsvd2 */

__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };

With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

...
Niranjana

...
-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Lionel Landwerlin

6:43 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 08/06/2022 09:40, Lionel Landwerlin wrote:

...

On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

...
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:

...
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

...
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: >> VM_BIND and related uapi definitions >> >> v2: Ensure proper kernel-doc formatting with cross references. >>     Also add new uapi and documentation as per review comments >>     from Daniel. >> >> Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com >> --- >> Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ >> 1 file changed, 399 insertions(+) >> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h >> new file mode 100644 >> index 000000000000..589c0a009107 >> --- /dev/null >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >> @@ -0,0 +1,399 @@ >> +/* SPDX-License-Identifier: MIT */ >> +/* >> + * Copyright © 2022 Intel Corporation >> + */ >> + >> +/** >> + * DOC: I915_PARAM_HAS_VM_BIND >> + * >> + * VM_BIND feature availability. >> + * See typedef drm_i915_getparam_t param. >> + */ >> +#define I915_PARAM_HAS_VM_BIND 57 >> + >> +/** >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >> + * >> + * Flag to opt-in for VM_BIND mode of binding during VM creation. >> + * See struct drm_i915_gem_vm_control flags. >> + * >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding. >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided >> + * to pass in the batch buffer addresses. >> + * >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0 >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0. >> + */ > >From that description, it seems we have: > >struct drm_i915_gem_execbuffer2 { >        __u64 buffers_ptr;              -> must be 0 (new) >        __u32 buffer_count;             -> must be 0 (new) >        __u32 batch_start_offset;       -> must be 0 (new) >        __u32 batch_len;                -> must be 0 (new) >        __u32 DR1;                      -> must be 0 (old) >        __u32 DR4;                      -> must be 0 (old) >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! >        __u64 flags;                    -> some flags must be 0 (new) >        __u64 rsvd1; (context info)     -> repurposed field (old) >        __u64 rsvd2;                    -> unused >}; > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead >of adding even more complexity to an already abused interface? While >the Vulkan-like extension thing is really nice, I don't think what >we're doing here is extending the ioctl usage, we're completely >changing how the base struct should be interpreted based on how the VM >was created (which is an entirely different ioctl). > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is >already at -6 without these changes. I think after vm_bind we'll need >to create a -11 entry just to deal with this ioctl. >

The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different than adding new ioctls.

I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.

So maybe we should do this?

Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Does this sound reasonable?

Thanks for proposing this. Some comments below.

...
struct drm_i915_gem_execbuffer3 {        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */

__u32 batch_count;        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu virtual addresses */

__u64 flags; #define I915_EXEC3_RING_MASK              (0x3f) #define I915_EXEC3_DEFAULT                (0<<0) #define I915_EXEC3_RENDER                 (1<<0) #define I915_EXEC3_BSD                    (2<<0) #define I915_EXEC3_BLT                    (3<<0) #define I915_EXEC3_VEBOX                  (4<<0)

Shouldn't we use the new engine selection uAPI instead?

We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in drm_i915_gem_context_create_ext_setparam.

And you can also create virtual engines with the same extension.

It feels like this could be a single u32 with the engine index (in the context engine map).

...
#define I915_EXEC3_SECURE               (1<<6) #define I915_EXEC3_IS_PINNED            (1<<7)

What's the meaning of PINNED?

...
#define I915_EXEC3_BSD_SHIFT     (8) #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)

#define I915_EXEC3_FENCE_IN             (1<<10) #define I915_EXEC3_FENCE_OUT            (1<<11)

For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.

So there isn't much point for FENCE_IN/OUT.

Maybe check with other UMDs?

Correcting myself a bit here :

- iris uses I915_EXEC_FENCE_ARRAY

- anv uses I915_EXEC_FENCE_ARRAY or DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES

In either case we could easily switch to DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES all the time.

...

...
#define I915_EXEC3_FENCE_SUBMIT (1<<12)

What's FENCE_SUBMIT?

...
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */

__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };

With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

...
Niranjana

...
-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Tvrtko Ursulin

8:36 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 08/06/2022 07:40, Lionel Landwerlin wrote:

...

On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

...
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:

...
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

...
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: >> VM_BIND and related uapi definitions >> >> v2: Ensure proper kernel-doc formatting with cross references. >>     Also add new uapi and documentation as per review comments >>     from Daniel. >> >> Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com >> --- >> Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ >> 1 file changed, 399 insertions(+) >> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h >> new file mode 100644 >> index 000000000000..589c0a009107 >> --- /dev/null >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >> @@ -0,0 +1,399 @@ >> +/* SPDX-License-Identifier: MIT */ >> +/* >> + * Copyright © 2022 Intel Corporation >> + */ >> + >> +/** >> + * DOC: I915_PARAM_HAS_VM_BIND >> + * >> + * VM_BIND feature availability. >> + * See typedef drm_i915_getparam_t param. >> + */ >> +#define I915_PARAM_HAS_VM_BIND               57 >> + >> +/** >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >> + * >> + * Flag to opt-in for VM_BIND mode of binding during VM creation. >> + * See struct drm_i915_gem_vm_control flags. >> + * >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding. >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided >> + * to pass in the batch buffer addresses. >> + * >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0 >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0. >> + */ > >From that description, it seems we have: > >struct drm_i915_gem_execbuffer2 { >        __u64 buffers_ptr;              -> must be 0 (new) >        __u32 buffer_count;             -> must be 0 (new) >        __u32 batch_start_offset;       -> must be 0 (new) >        __u32 batch_len;                -> must be 0 (new) >        __u32 DR1;                      -> must be 0 (old) >        __u32 DR4;                      -> must be 0 (old) >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! >        __u64 flags;                    -> some flags must be 0 (new) >        __u64 rsvd1; (context info)     -> repurposed field (old) >        __u64 rsvd2;                    -> unused >}; > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead >of adding even more complexity to an already abused interface? While >the Vulkan-like extension thing is really nice, I don't think what >we're doing here is extending the ioctl usage, we're completely >changing how the base struct should be interpreted based on how the VM >was created (which is an entirely different ioctl). > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is >already at -6 without these changes. I think after vm_bind we'll need >to create a -11 entry just to deal with this ioctl. >

The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different than adding new ioctls.

I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.

So maybe we should do this?

Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Does this sound reasonable?

Thanks for proposing this. Some comments below.

...
struct drm_i915_gem_execbuffer3 {        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */

__u32 batch_count;        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu virtual addresses */

__u64 flags; #define I915_EXEC3_RING_MASK              (0x3f) #define I915_EXEC3_DEFAULT                (0<<0) #define I915_EXEC3_RENDER                 (1<<0) #define I915_EXEC3_BSD                    (2<<0) #define I915_EXEC3_BLT                    (3<<0) #define I915_EXEC3_VEBOX                  (4<<0)

Shouldn't we use the new engine selection uAPI instead?

We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in drm_i915_gem_context_create_ext_setparam.

And you can also create virtual engines with the same extension.

It feels like this could be a single u32 with the engine index (in the context engine map).

Yes I said the same yesterday.

Also note that as you can't any longer set engines on a default context, question is whether userspace cares to use execbuf3 with it (default context).

If it does, it will need an alternative engine selection for that case. I was proposing class:instance rather than legacy cumbersome flags.

If it does not, I mean if the decision is to only allow execbuf3 with engine maps, then it leaves the default context a waste of kernel memory in the execbuf3 future. :( Don't know what to do there..

Regards,

Tvrtko

...

...
#define I915_EXEC3_SECURE               (1<<6) #define I915_EXEC3_IS_PINNED            (1<<7)

What's the meaning of PINNED?

...
#define I915_EXEC3_BSD_SHIFT     (8) #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)

#define I915_EXEC3_FENCE_IN             (1<<10) #define I915_EXEC3_FENCE_OUT            (1<<11)

For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.

So there isn't much point for FENCE_IN/OUT.

Maybe check with other UMDs?

...
#define I915_EXEC3_FENCE_SUBMIT         (1<<12)

What's FENCE_SUBMIT?

...
__u64 in_out_fence;        /* previously execbuffer2.rsvd2 */

__u64 extensions;        /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };

With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

...
Niranjana

...
-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Lionel Landwerlin

8:45 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 08/06/2022 11:36, Tvrtko Ursulin wrote:

...

On 08/06/2022 07:40, Lionel Landwerlin wrote:

...
On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

...
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:

...
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote: > > On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: > >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura > wrote: > >> VM_BIND and related uapi definitions > >> > >> v2: Ensure proper kernel-doc formatting with cross references. > >>     Also add new uapi and documentation as per review comments > >>     from Daniel. > >> > >> Signed-off-by: Niranjana Vishwanathapura > niranjana.vishwanathapura@intel.com > >> --- > >> Documentation/gpu/rfc/i915_vm_bind.h | 399 > +++++++++++++++++++++++++++ > >> 1 file changed, 399 insertions(+) > >> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h > >> > >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h > b/Documentation/gpu/rfc/i915_vm_bind.h > >> new file mode 100644 > >> index 000000000000..589c0a009107 > >> --- /dev/null > >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h > >> @@ -0,0 +1,399 @@ > >> +/* SPDX-License-Identifier: MIT */ > >> +/* > >> + * Copyright © 2022 Intel Corporation > >> + */ > >> + > >> +/** > >> + * DOC: I915_PARAM_HAS_VM_BIND > >> + * > >> + * VM_BIND feature availability. > >> + * See typedef drm_i915_getparam_t param. > >> + */ > >> +#define I915_PARAM_HAS_VM_BIND 57 > >> + > >> +/** > >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND > >> + * > >> + * Flag to opt-in for VM_BIND mode of binding during VM > creation. > >> + * See struct drm_i915_gem_vm_control flags. > >> + * > >> + * A VM in VM_BIND mode will not support the older execbuff > mode of binding. > >> + * In VM_BIND mode, execbuff ioctl will not accept any > execlist (ie., the > >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). > >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and > >> + * &drm_i915_gem_execbuffer2.batch_len must be 0. > >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension > must be provided > >> + * to pass in the batch buffer addresses. > >> + * > >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and > >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags > must be 0 > >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag > must always be > >> + * set (See struct > drm_i915_gem_execbuffer_ext_batch_addresses). > >> + * The buffers_ptr, buffer_count, batch_start_offset and > batch_len fields > >> + * of struct drm_i915_gem_execbuffer2 are also not used and > must be 0. > >> + */ > > > >From that description, it seems we have: > > > >struct drm_i915_gem_execbuffer2 { > >        __u64 buffers_ptr;              -> must be 0 (new) > >        __u32 buffer_count;             -> must be 0 (new) > >        __u32 batch_start_offset;       -> must be 0 (new) > >        __u32 batch_len;                -> must be 0 (new) > >        __u32 DR1;                      -> must be 0 (old) > >        __u32 DR4;                      -> must be 0 (old) > >        __u32 num_cliprects; (fences)   -> must be 0 since > using extensions > >        __u64 cliprects_ptr; (fences, extensions) -> contains > an actual pointer! > >        __u64 flags;                    -> some flags must be 0 > (new) > >        __u64 rsvd1; (context info)     -> repurposed field (old) > >        __u64 rsvd2;                    -> unused > >}; > > > >Based on that, why can't we just get drm_i915_gem_execbuffer3 > instead > >of adding even more complexity to an already abused interface? > While > >the Vulkan-like extension thing is really nice, I don't think what > >we're doing here is extending the ioctl usage, we're completely > >changing how the base struct should be interpreted based on how > the VM > >was created (which is an entirely different ioctl). > > > >From Rusty Russel's API Design grading, > drm_i915_gem_execbuffer2 is > >already at -6 without these changes. I think after vm_bind > we'll need > >to create a -11 entry just to deal with this ioctl. > > > > The only change here is removing the execlist support for VM_BIND > mode (other than natual extensions). > Adding a new execbuffer3 was considered, but I think we need to > be careful > with that as that goes beyond the VM_BIND support, including any > future > requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different than adding new ioctls.

I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.

So maybe we should do this?

Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Does this sound reasonable?

Thanks for proposing this. Some comments below.

...
struct drm_i915_gem_execbuffer3 {        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */

__u32 batch_count;        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu virtual addresses */

__u64 flags; #define I915_EXEC3_RING_MASK              (0x3f) #define I915_EXEC3_DEFAULT                (0<<0) #define I915_EXEC3_RENDER                 (1<<0) #define I915_EXEC3_BSD                    (2<<0) #define I915_EXEC3_BLT                    (3<<0) #define I915_EXEC3_VEBOX                  (4<<0)

Shouldn't we use the new engine selection uAPI instead?

We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in drm_i915_gem_context_create_ext_setparam.

And you can also create virtual engines with the same extension.

It feels like this could be a single u32 with the engine index (in the context engine map).

Yes I said the same yesterday.

Also note that as you can't any longer set engines on a default context, question is whether userspace cares to use execbuf3 with it (default context).

If it does, it will need an alternative engine selection for that case. I was proposing class:instance rather than legacy cumbersome flags.

If it does not, I mean if the decision is to only allow execbuf3 with engine maps, then it leaves the default context a waste of kernel memory in the execbuf3 future. :( Don't know what to do there..

Regards,

Tvrtko

Thanks Tvrtko, I only saw your reply after responding.

Both Iris & Anv create a context with engines (if kernel supports it) : https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/common/intel_...

I think we should be fine with just a single engine id and we don't care about the default context.

-Lionel

...

...
...
#define I915_EXEC3_SECURE               (1<<6) #define I915_EXEC3_IS_PINNED            (1<<7)

What's the meaning of PINNED?

...
#define I915_EXEC3_BSD_SHIFT     (8) #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)

#define I915_EXEC3_FENCE_IN             (1<<10) #define I915_EXEC3_FENCE_OUT            (1<<11)

For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.

So there isn't much point for FENCE_IN/OUT.

Maybe check with other UMDs?

...
#define I915_EXEC3_FENCE_SUBMIT (1<<12)

What's FENCE_SUBMIT?

...
__u64 in_out_fence;        /* previously execbuffer2.rsvd2 */

__u64 extensions;        /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };

With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

...
Niranjana

...
-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Tvrtko Ursulin

8:54 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 08/06/2022 09:45, Lionel Landwerlin wrote:

...

On 08/06/2022 11:36, Tvrtko Ursulin wrote:

...
On 08/06/2022 07:40, Lionel Landwerlin wrote:

...
On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

...
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote: > > On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura > niranjana.vishwanathapura@intel.com wrote: >> >> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >> >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura >> wrote: >> >> VM_BIND and related uapi definitions >> >> >> >> v2: Ensure proper kernel-doc formatting with cross references. >> >>     Also add new uapi and documentation as per review comments >> >>     from Daniel. >> >> >> >> Signed-off-by: Niranjana Vishwanathapura >> niranjana.vishwanathapura@intel.com >> >> --- >> >> Documentation/gpu/rfc/i915_vm_bind.h | 399 >> +++++++++++++++++++++++++++ >> >> 1 file changed, 399 insertions(+) >> >> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >> >> >> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h >> b/Documentation/gpu/rfc/i915_vm_bind.h >> >> new file mode 100644 >> >> index 000000000000..589c0a009107 >> >> --- /dev/null >> >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >> >> @@ -0,0 +1,399 @@ >> >> +/* SPDX-License-Identifier: MIT */ >> >> +/* >> >> + * Copyright © 2022 Intel Corporation >> >> + */ >> >> + >> >> +/** >> >> + * DOC: I915_PARAM_HAS_VM_BIND >> >> + * >> >> + * VM_BIND feature availability. >> >> + * See typedef drm_i915_getparam_t param. >> >> + */ >> >> +#define I915_PARAM_HAS_VM_BIND 57 >> >> + >> >> +/** >> >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >> >> + * >> >> + * Flag to opt-in for VM_BIND mode of binding during VM >> creation. >> >> + * See struct drm_i915_gem_vm_control flags. >> >> + * >> >> + * A VM in VM_BIND mode will not support the older execbuff >> mode of binding. >> >> + * In VM_BIND mode, execbuff ioctl will not accept any >> execlist (ie., the >> >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >> >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >> >> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >> >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension >> must be provided >> >> + * to pass in the batch buffer addresses. >> >> + * >> >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >> >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags >> must be 0 >> >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag >> must always be >> >> + * set (See struct >> drm_i915_gem_execbuffer_ext_batch_addresses). >> >> + * The buffers_ptr, buffer_count, batch_start_offset and >> batch_len fields >> >> + * of struct drm_i915_gem_execbuffer2 are also not used and >> must be 0. >> >> + */ >> > >> >From that description, it seems we have: >> > >> >struct drm_i915_gem_execbuffer2 { >> >        __u64 buffers_ptr;              -> must be 0 (new) >> >        __u32 buffer_count;             -> must be 0 (new) >> >        __u32 batch_start_offset;       -> must be 0 (new) >> >        __u32 batch_len;                -> must be 0 (new) >> >        __u32 DR1;                      -> must be 0 (old) >> >        __u32 DR4;                      -> must be 0 (old) >> >        __u32 num_cliprects; (fences)   -> must be 0 since >> using extensions >> >        __u64 cliprects_ptr; (fences, extensions) -> contains >> an actual pointer! >> >        __u64 flags;                    -> some flags must be 0 >> (new) >> >        __u64 rsvd1; (context info)     -> repurposed field (old) >> >        __u64 rsvd2;                    -> unused >> >}; >> > >> >Based on that, why can't we just get drm_i915_gem_execbuffer3 >> instead >> >of adding even more complexity to an already abused interface? >> While >> >the Vulkan-like extension thing is really nice, I don't think what >> >we're doing here is extending the ioctl usage, we're completely >> >changing how the base struct should be interpreted based on how >> the VM >> >was created (which is an entirely different ioctl). >> > >> >From Rusty Russel's API Design grading, >> drm_i915_gem_execbuffer2 is >> >already at -6 without these changes. I think after vm_bind >> we'll need >> >to create a -11 entry just to deal with this ioctl. >> > >> >> The only change here is removing the execlist support for VM_BIND >> mode (other than natual extensions). >> Adding a new execbuffer3 was considered, but I think we need to >> be careful >> with that as that goes beyond the VM_BIND support, including any >> future >> requirements (as we don't want an execbuffer4 after VM_BIND). > > Why not? it's not like adding extensions here is really that > different > than adding new ioctls. > > I definitely think this deserves an execbuffer3 without even > considering future requirements. Just to burn down the old > requirements and pointless fields. > > Make execbuffer3 be vm bind only, no relocs, no legacy bits, > leave the > older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.

So maybe we should do this?

Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Does this sound reasonable?

Thanks for proposing this. Some comments below.

...
struct drm_i915_gem_execbuffer3 {        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */

__u32 batch_count;        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu virtual addresses */

__u64 flags; #define I915_EXEC3_RING_MASK              (0x3f) #define I915_EXEC3_DEFAULT                (0<<0) #define I915_EXEC3_RENDER                 (1<<0) #define I915_EXEC3_BSD                    (2<<0) #define I915_EXEC3_BLT                    (3<<0) #define I915_EXEC3_VEBOX                  (4<<0)

Shouldn't we use the new engine selection uAPI instead?

We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in drm_i915_gem_context_create_ext_setparam.

And you can also create virtual engines with the same extension.

It feels like this could be a single u32 with the engine index (in the context engine map).

Yes I said the same yesterday.

Also note that as you can't any longer set engines on a default context, question is whether userspace cares to use execbuf3 with it (default context).

If it does, it will need an alternative engine selection for that case. I was proposing class:instance rather than legacy cumbersome flags.

If it does not, I mean if the decision is to only allow execbuf3 with engine maps, then it leaves the default context a waste of kernel memory in the execbuf3 future. :( Don't know what to do there..

Regards,

Tvrtko

Thanks Tvrtko, I only saw your reply after responding.

Both Iris & Anv create a context with engines (if kernel supports it) : https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/common/intel_...

I think we should be fine with just a single engine id and we don't care about the default context.

I wonder if in this case we could stop creating the default context starting from a future "gen"? Otherwise, with engine map only execbuf3 and execbuf3 only userspace, it would serve no purpose apart from wasting kernel memory.

Regards,

Tvrtko

...

-Lionel

...
...
...
#define I915_EXEC3_SECURE               (1<<6) #define I915_EXEC3_IS_PINNED            (1<<7)

What's the meaning of PINNED?

...
#define I915_EXEC3_BSD_SHIFT     (8) #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)

#define I915_EXEC3_FENCE_IN             (1<<10) #define I915_EXEC3_FENCE_OUT            (1<<11)

For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.

So there isn't much point for FENCE_IN/OUT.

Maybe check with other UMDs?

...
#define I915_EXEC3_FENCE_SUBMIT (1<<12)

What's FENCE_SUBMIT?

...
__u64 in_out_fence;        /* previously execbuffer2.rsvd2 */

__u64 extensions;        /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };

With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

...
Niranjana

...
-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Niranjana Vishwanathapura

8:45 p.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On Wed, Jun 08, 2022 at 09:54:24AM +0100, Tvrtko Ursulin wrote:

...

On 08/06/2022 09:45, Lionel Landwerlin wrote:

...
On 08/06/2022 11:36, Tvrtko Ursulin wrote:

...
On 08/06/2022 07:40, Lionel Landwerlin wrote:

...
On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote: >On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote: >> >>On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura >>niranjana.vishwanathapura@intel.com wrote: >>> >>>On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >>>>On Tue, 2022-05-17 at 11:32 -0700, Niranjana >>>Vishwanathapura wrote: >>>>> VM_BIND and related uapi definitions >>>>> >>>>> v2: Ensure proper kernel-doc formatting with cross references. >>>>>     Also add new uapi and documentation as per review comments >>>>>     from Daniel. >>>>> >>>>> Signed-off-by: Niranjana Vishwanathapura >>>niranjana.vishwanathapura@intel.com >>>>> --- >>>>> Documentation/gpu/rfc/i915_vm_bind.h | 399 >>>+++++++++++++++++++++++++++ >>>>> 1 file changed, 399 insertions(+) >>>>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >>>>> >>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h >>>b/Documentation/gpu/rfc/i915_vm_bind.h >>>>> new file mode 100644 >>>>> index 000000000000..589c0a009107 >>>>> --- /dev/null >>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >>>>> @@ -0,0 +1,399 @@ >>>>> +/* SPDX-License-Identifier: MIT */ >>>>> +/* >>>>> + * Copyright © 2022 Intel Corporation >>>>> + */ >>>>> + >>>>> +/** >>>>> + * DOC: I915_PARAM_HAS_VM_BIND >>>>> + * >>>>> + * VM_BIND feature availability. >>>>> + * See typedef drm_i915_getparam_t param. >>>>> + */ >>>>> +#define I915_PARAM_HAS_VM_BIND 57 >>>>> + >>>>> +/** >>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >>>>> + * >>>>> + * Flag to opt-in for VM_BIND mode of binding >>>during VM creation. >>>>> + * See struct drm_i915_gem_vm_control flags. >>>>> + * >>>>> + * A VM in VM_BIND mode will not support the older >>>execbuff mode of binding. >>>>> + * In VM_BIND mode, execbuff ioctl will not accept >>>any execlist (ie., the >>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES >>>extension must be provided >>>>> + * to pass in the batch buffer addresses. >>>>> + * >>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >>>>> + * I915_EXEC_BATCH_FIRST of >>>&drm_i915_gem_execbuffer2.flags must be 0 >>>>> + * (not used) in VM_BIND mode. >>>I915_EXEC_USE_EXTENSIONS flag must always be >>>>> + * set (See struct >>>drm_i915_gem_execbuffer_ext_batch_addresses). >>>>> + * The buffers_ptr, buffer_count, >>>batch_start_offset and batch_len fields >>>>> + * of struct drm_i915_gem_execbuffer2 are also not >>>used and must be 0. >>>>> + */ >>>> >>>>From that description, it seems we have: >>>> >>>>struct drm_i915_gem_execbuffer2 { >>>>        __u64 buffers_ptr;              -> must be 0 (new) >>>>        __u32 buffer_count;             -> must be 0 (new) >>>>        __u32 batch_start_offset;       -> must be 0 (new) >>>>        __u32 batch_len;                -> must be 0 (new) >>>>        __u32 DR1;                      -> must be 0 (old) >>>>        __u32 DR4;                      -> must be 0 (old) >>>>        __u32 num_cliprects; (fences)   -> must be 0 >>>since using extensions >>>>        __u64 cliprects_ptr; (fences, extensions) -> >>>contains an actual pointer! >>>>        __u64 flags;                    -> some flags >>>must be 0 (new) >>>>        __u64 rsvd1; (context info)     -> repurposed field (old) >>>>        __u64 rsvd2;                    -> unused >>>>}; >>>> >>>>Based on that, why can't we just get >>>drm_i915_gem_execbuffer3 instead >>>>of adding even more complexity to an already abused >>>interface? While >>>>the Vulkan-like extension thing is really nice, I don't think what >>>>we're doing here is extending the ioctl usage, we're completely >>>>changing how the base struct should be interpreted >>>based on how the VM >>>>was created (which is an entirely different ioctl). >>>> >>>>From Rusty Russel's API Design grading, >>>drm_i915_gem_execbuffer2 is >>>>already at -6 without these changes. I think after >>>vm_bind we'll need >>>>to create a -11 entry just to deal with this ioctl. >>>> >>> >>>The only change here is removing the execlist support for VM_BIND >>>mode (other than natual extensions). >>>Adding a new execbuffer3 was considered, but I think >>>we need to be careful >>>with that as that goes beyond the VM_BIND support, >>>including any future >>>requirements (as we don't want an execbuffer4 after VM_BIND). >> >>Why not? it's not like adding extensions here is really >>that different >>than adding new ioctls. >> >>I definitely think this deserves an execbuffer3 without even >>considering future requirements. Just to burn down the old >>requirements and pointless fields. >> >>Make execbuffer3 be vm bind only, no relocs, no legacy >>bits, leave the >>older sw on execbuf2 for ever. > >I guess another point in favour of execbuf3 would be that it's less >midlayer. If we share the entry point then there's quite a few vfuncs >needed to cleanly split out the vm_bind paths from the legacy >reloc/softping paths. > >If we invert this and do execbuf3, then there's the existing ioctl >vfunc, and then we share code (where it even makes sense, probably >request setup/submit need to be shared, anything else is probably >cleaner to just copypaste) with the usual helper approach. > >Also that would guarantee that really none of the old concepts like >i915_active on the vma or vma open counts and all that stuff leaks >into the new vm_bind execbuf. > >Finally I also think that copypasting would make backporting easier, >or at least more flexible, since it should make it easier to have the >upstream vm_bind co-exist with all the other things we have. Without >huge amounts of conflicts (or at least much less) that pushing a pile >of vfuncs into the existing code would cause. > >So maybe we should do this?

Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Does this sound reasonable?

Thanks for proposing this. Some comments below.

...
struct drm_i915_gem_execbuffer3 {        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */

__u32 batch_count;        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu virtual addresses */

__u64 flags; #define I915_EXEC3_RING_MASK              (0x3f) #define I915_EXEC3_DEFAULT                (0<<0) #define I915_EXEC3_RENDER                 (1<<0) #define I915_EXEC3_BSD                    (2<<0) #define I915_EXEC3_BLT                    (3<<0) #define I915_EXEC3_VEBOX                  (4<<0)

Shouldn't we use the new engine selection uAPI instead?

We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in drm_i915_gem_context_create_ext_setparam.

And you can also create virtual engines with the same extension.

It feels like this could be a single u32 with the engine index (in the context engine map).

Yes I said the same yesterday.

Also note that as you can't any longer set engines on a default context, question is whether userspace cares to use execbuf3 with it (default context).

If it does, it will need an alternative engine selection for that case. I was proposing class:instance rather than legacy cumbersome flags.

If it does not, I mean if the decision is to only allow execbuf3 with engine maps, then it leaves the default context a waste of kernel memory in the execbuf3 future. :( Don't know what to do there..

Regards,

Tvrtko

Thanks Tvrtko, I only saw your reply after responding.

Both Iris & Anv create a context with engines (if kernel supports it) : https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/common/intel_...

I think we should be fine with just a single engine id and we don't care about the default context.

I wonder if in this case we could stop creating the default context starting from a future "gen"? Otherwise, with engine map only execbuf3 and execbuf3 only userspace, it would serve no purpose apart from wasting kernel memory.

Thanks Tvrtko, Lionell.

I will be glad to remove these flags, just define a uint32 engine_id and mandate a context with user engines map.

Regarding removing the default context, yah, it depends on from which gen onwards we will only be supporting execbuf3 and execbuf2 is fully deprecated. Till then, we will have to keep it I guess :(.

...

Regards,

Tvrtko

...
-Lionel

...
...
...
#define I915_EXEC3_SECURE (1<<6) #define I915_EXEC3_IS_PINNED (1<<7)

What's the meaning of PINNED?

This turned out to be a legacy use case. Will remove it. execbuf3 will anyway only be supported when HAS_VM_BIND is true.

...

...
...
...
...
#define I915_EXEC3_BSD_SHIFT (8) #define I915_EXEC3_BSD_MASK (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1 (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2 (2 << I915_EXEC3_BSD_SHIFT)

#define I915_EXEC3_FENCE_IN (1<<10) #define I915_EXEC3_FENCE_OUT (1<<11)

For Mesa, as soon as we have DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES support, we only use that.

So there isn't much point for FENCE_IN/OUT.

Maybe check with other UMDs?

Thanks, will remove it if other UMDs do not ask for it.

...

...
...
...
...
#define I915_EXEC3_FENCE_SUBMIT (1<<12)

What's FENCE_SUBMIT?

This seems to be a mechanism to align requests submissions together. As per Tvrtko, generic mechanism to align submissions was rejected. So, if UMDs don't need it, we can remove it.

So, execbuf3 would look like (if all UMDS agree),

struct drm_i915_gem_execbuffer3 { __u32 ctx_id; /* previously execbuffer2.rsvd1 */ __u32 engine_id; /* previously 'execbuffer2.flags & I915_EXEC_RING_MASK' */

__u32 rsvd1; /* Reserved */ __u32 batch_count; /* batch VA if batch_count=1, otherwise a pointer to an array of batch VAs */ __u64 batch_address;

__u64 flags; #define I915_EXEC3_SECURE (1<<0)

__u64 rsvd2; /* Reserved */ __u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };

Also, wondered if we need to put timeline fences in the extension or should we directly put it in drm_i915_gem_execbuffer3 struct. I prefer putting it in extension if they are not specified for all execbuff calls. Any thoughts?

Niranjana

...

...
...
...
...
__u64 in_out_fence; /* previously execbuffer2.rsvd2 */

__u64 extensions; /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };

With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

...
Niranjana

>-Daniel >-- >Daniel Vetter >Software Engineer, Intel Corporation >http://blog.ffwll.ch

Tvrtko Ursulin

15 Jun 15 Jun

9:49 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 08/06/2022 21:45, Niranjana Vishwanathapura wrote:

...

On Wed, Jun 08, 2022 at 09:54:24AM +0100, Tvrtko Ursulin wrote:

...
On 08/06/2022 09:45, Lionel Landwerlin wrote:

...
On 08/06/2022 11:36, Tvrtko Ursulin wrote:

...
On 08/06/2022 07:40, Lionel Landwerlin wrote:

...
On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote: > On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote: >> On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote: >>> >>> On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura >>> niranjana.vishwanathapura@intel.com wrote: >>>> >>>> On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >>>>> On Tue, 2022-05-17 at 11:32 -0700, Niranjana >>>> Vishwanathapura wrote: >>>>>> VM_BIND and related uapi definitions >>>>>> >>>>>> v2: Ensure proper kernel-doc formatting with cross references. >>>>>>      Also add new uapi and documentation as per review comments >>>>>>      from Daniel. >>>>>> >>>>>> Signed-off-by: Niranjana Vishwanathapura >>>> niranjana.vishwanathapura@intel.com >>>>>> --- >>>>>> Documentation/gpu/rfc/i915_vm_bind.h | 399 >>>> +++++++++++++++++++++++++++ >>>>>> 1 file changed, 399 insertions(+) >>>>>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >>>>>> >>>>>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h >>>> b/Documentation/gpu/rfc/i915_vm_bind.h >>>>>> new file mode 100644 >>>>>> index 000000000000..589c0a009107 >>>>>> --- /dev/null >>>>>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >>>>>> @@ -0,0 +1,399 @@ >>>>>> +/* SPDX-License-Identifier: MIT */ >>>>>> +/* >>>>>> + * Copyright © 2022 Intel Corporation >>>>>> + */ >>>>>> + >>>>>> +/** >>>>>> + * DOC: I915_PARAM_HAS_VM_BIND >>>>>> + * >>>>>> + * VM_BIND feature availability. >>>>>> + * See typedef drm_i915_getparam_t param. >>>>>> + */ >>>>>> +#define I915_PARAM_HAS_VM_BIND 57 >>>>>> + >>>>>> +/** >>>>>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >>>>>> + * >>>>>> + * Flag to opt-in for VM_BIND mode of binding >>>> during VM creation. >>>>>> + * See struct drm_i915_gem_vm_control flags. >>>>>> + * >>>>>> + * A VM in VM_BIND mode will not support the older >>>> execbuff mode of binding. >>>>>> + * In VM_BIND mode, execbuff ioctl will not accept >>>> any execlist (ie., the >>>>>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >>>>>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >>>>>> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >>>>>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES >>>> extension must be provided >>>>>> + * to pass in the batch buffer addresses. >>>>>> + * >>>>>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >>>>>> + * I915_EXEC_BATCH_FIRST of >>>> &drm_i915_gem_execbuffer2.flags must be 0 >>>>>> + * (not used) in VM_BIND mode. >>>> I915_EXEC_USE_EXTENSIONS flag must always be >>>>>> + * set (See struct >>>> drm_i915_gem_execbuffer_ext_batch_addresses). >>>>>> + * The buffers_ptr, buffer_count, >>>> batch_start_offset and batch_len fields >>>>>> + * of struct drm_i915_gem_execbuffer2 are also not >>>> used and must be 0. >>>>>> + */ >>>>> >>>>> From that description, it seems we have: >>>>> >>>>> struct drm_i915_gem_execbuffer2 { >>>>>         __u64 buffers_ptr;              -> must be 0 (new) >>>>>         __u32 buffer_count;             -> must be 0 (new) >>>>>         __u32 batch_start_offset;       -> must be 0 (new) >>>>>         __u32 batch_len;                -> must be 0 (new) >>>>>         __u32 DR1;                      -> must be 0 (old) >>>>>         __u32 DR4;                      -> must be 0 (old) >>>>>         __u32 num_cliprects; (fences)   -> must be 0 >>>> since using extensions >>>>>         __u64 cliprects_ptr; (fences, extensions) -> >>>> contains an actual pointer! >>>>>         __u64 flags;                    -> some flags >>>> must be 0 (new) >>>>>         __u64 rsvd1; (context info)     -> repurposed field >>>>> (old) >>>>>         __u64 rsvd2;                    -> unused >>>>> }; >>>>> >>>>> Based on that, why can't we just get >>>> drm_i915_gem_execbuffer3 instead >>>>> of adding even more complexity to an already abused >>>> interface? While >>>>> the Vulkan-like extension thing is really nice, I don't think >>>>> what >>>>> we're doing here is extending the ioctl usage, we're completely >>>>> changing how the base struct should be interpreted >>>> based on how the VM >>>>> was created (which is an entirely different ioctl). >>>>> >>>>> From Rusty Russel's API Design grading, >>>> drm_i915_gem_execbuffer2 is >>>>> already at -6 without these changes. I think after >>>> vm_bind we'll need >>>>> to create a -11 entry just to deal with this ioctl. >>>>> >>>> >>>> The only change here is removing the execlist support for VM_BIND >>>> mode (other than natual extensions). >>>> Adding a new execbuffer3 was considered, but I think we need >>>> to be careful >>>> with that as that goes beyond the VM_BIND support, including >>>> any future >>>> requirements (as we don't want an execbuffer4 after VM_BIND). >>> >>> Why not? it's not like adding extensions here is really that >>> different >>> than adding new ioctls. >>> >>> I definitely think this deserves an execbuffer3 without even >>> considering future requirements. Just to burn down the old >>> requirements and pointless fields. >>> >>> Make execbuffer3 be vm bind only, no relocs, no legacy bits, >>> leave the >>> older sw on execbuf2 for ever. >> >> I guess another point in favour of execbuf3 would be that it's less >> midlayer. If we share the entry point then there's quite a few >> vfuncs >> needed to cleanly split out the vm_bind paths from the legacy >> reloc/softping paths. >> >> If we invert this and do execbuf3, then there's the existing ioctl >> vfunc, and then we share code (where it even makes sense, probably >> request setup/submit need to be shared, anything else is probably >> cleaner to just copypaste) with the usual helper approach. >> >> Also that would guarantee that really none of the old concepts like >> i915_active on the vma or vma open counts and all that stuff leaks >> into the new vm_bind execbuf. >> >> Finally I also think that copypasting would make backporting >> easier, >> or at least more flexible, since it should make it easier to >> have the >> upstream vm_bind co-exist with all the other things we have. >> Without >> huge amounts of conflicts (or at least much less) that pushing a >> pile >> of vfuncs into the existing code would cause. >> >> So maybe we should do this? > > Thanks Dave, Daniel. > There are a few things that will be common between execbuf2 and > execbuf3, like request setup/submit (as you said), fence handling > (timeline fences, fence array, composite fences), engine selection, > etc. Also, many of the 'flags' will be there in execbuf3 also (but > bit position will differ). > But I guess these should be fine as the suggestion here is to > copy-paste the execbuff code and having a shared code where > possible. > Besides, we can stop supporting some older feature in execbuff3 > (like fence array in favor of newer timeline fences), which will > further reduce common code. > > Ok, I will update this series by adding execbuf3 and send out soon. >

Does this sound reasonable?

Thanks for proposing this. Some comments below.

...
struct drm_i915_gem_execbuffer3 {        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */

__u32 batch_count;        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu virtual addresses */

__u64 flags; #define I915_EXEC3_RING_MASK              (0x3f) #define I915_EXEC3_DEFAULT                (0<<0) #define I915_EXEC3_RENDER                 (1<<0) #define I915_EXEC3_BSD                    (2<<0) #define I915_EXEC3_BLT                    (3<<0) #define I915_EXEC3_VEBOX                  (4<<0)

Shouldn't we use the new engine selection uAPI instead?

We can already create an engine map with I915_CONTEXT_PARAM_ENGINES in drm_i915_gem_context_create_ext_setparam.

And you can also create virtual engines with the same extension.

It feels like this could be a single u32 with the engine index (in the context engine map).

Yes I said the same yesterday.

Also note that as you can't any longer set engines on a default context, question is whether userspace cares to use execbuf3 with it (default context).

If it does, it will need an alternative engine selection for that case. I was proposing class:instance rather than legacy cumbersome flags.

If it does not, I mean if the decision is to only allow execbuf3 with engine maps, then it leaves the default context a waste of kernel memory in the execbuf3 future. :( Don't know what to do there..

Regards,

Tvrtko

Thanks Tvrtko, I only saw your reply after responding.

Both Iris & Anv create a context with engines (if kernel supports it) : https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/common/intel_...

I think we should be fine with just a single engine id and we don't care about the default context.

I wonder if in this case we could stop creating the default context starting from a future "gen"? Otherwise, with engine map only execbuf3 and execbuf3 only userspace, it would serve no purpose apart from wasting kernel memory.

Thanks Tvrtko, Lionell.

I will be glad to remove these flags, just define a uint32 engine_id and mandate a context with user engines map.

Regarding removing the default context, yah, it depends on from which gen onwards we will only be supporting execbuf3 and execbuf2 is fully deprecated. Till then, we will have to keep it I guess :(.

Forgot about this sub-thread.. I think it could be removed before execbuf2 is fully deprecated. We can make that decision with any new platform which needs UMD stack updates to be supported. But it is work for us to adjust IGT so I am not hopeful anyone will tackle it. We will just end up wasting memory.

Regards,

Tvrtko

Lionel Landwerlin

8 Jun 8 Jun

7:12 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:

...

On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

...
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:

...
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

...
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote:

...
On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: > VM_BIND and related uapi definitions > > v2: Ensure proper kernel-doc formatting with cross references. >     Also add new uapi and documentation as per review comments >     from Daniel. > > Signed-off-by: Niranjana Vishwanathapura

niranjana.vishwanathapura@intel.com

...
> --- > Documentation/gpu/rfc/i915_vm_bind.h | 399

+++++++++++++++++++++++++++

...
> 1 file changed, 399 insertions(+) > create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h > > diff --git a/Documentation/gpu/rfc/i915_vm_bind.h

b/Documentation/gpu/rfc/i915_vm_bind.h

...
> new file mode 100644 > index 000000000000..589c0a009107 > --- /dev/null > +++ b/Documentation/gpu/rfc/i915_vm_bind.h > @@ -0,0 +1,399 @@ > +/* SPDX-License-Identifier: MIT */ > +/* > + * Copyright © 2022 Intel Corporation > + */ > + > +/** > + * DOC: I915_PARAM_HAS_VM_BIND > + * > + * VM_BIND feature availability. > + * See typedef drm_i915_getparam_t param. > + */ > +#define I915_PARAM_HAS_VM_BIND               57 > + > +/** > + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND > + * > + * Flag to opt-in for VM_BIND mode of binding during VM creation. > + * See struct drm_i915_gem_vm_control flags. > + * > + * A VM in VM_BIND mode will not support the older execbuff

mode of binding.

...
> + * In VM_BIND mode, execbuff ioctl will not accept any

execlist (ie., the

...
> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). > + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and > + * &drm_i915_gem_execbuffer2.batch_len must be 0. > + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must

be provided

...
> + * to pass in the batch buffer addresses. > + * > + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and > + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags

must be 0

...
> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag

must always be

...
> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). > + * The buffers_ptr, buffer_count, batch_start_offset and

batch_len fields

...
> + * of struct drm_i915_gem_execbuffer2 are also not used and

must be 0.

...
> + */

From that description, it seems we have:

struct drm_i915_gem_execbuffer2 {         __u64 buffers_ptr;              -> must be 0 (new)         __u32 buffer_count;             -> must be 0 (new)         __u32 batch_start_offset;       -> must be 0 (new)         __u32 batch_len;                -> must be 0 (new)         __u32 DR1;                      -> must be 0 (old)         __u32 DR4;                      -> must be 0 (old)         __u32 num_cliprects; (fences)   -> must be 0 since using

extensions

...
__u64 cliprects_ptr; (fences, extensions) -> contains an

actual pointer!

...
__u64 flags;                    -> some flags must be 0

(new)

...
__u64 rsvd1; (context info)     -> repurposed field (old)         __u64 rsvd2;                    -> unused };

Based on that, why can't we just get drm_i915_gem_execbuffer3

instead

...
of adding even more complexity to an already abused interface? While the Vulkan-like extension thing is really nice, I don't think what we're doing here is extending the ioctl usage, we're completely changing how the base struct should be interpreted based on how

the VM

...
was created (which is an entirely different ioctl).

From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is already at -6 without these changes. I think after vm_bind we'll

need

...
to create a -11 entry just to deal with this ioctl.

The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different than adding new ioctls.

I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.

So maybe we should do this?

Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Does this sound reasonable?

struct drm_i915_gem_execbuffer3 {        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */

__u32 batch_count;        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu virtual addresses */

Quick question raised on IRC about the batches : Are multiple batches limited to virtual engines?

Thanks,

-Lionel

...

__u64 flags; #define I915_EXEC3_RING_MASK              (0x3f) #define I915_EXEC3_DEFAULT                (0<<0) #define I915_EXEC3_RENDER                 (1<<0) #define I915_EXEC3_BSD                    (2<<0) #define I915_EXEC3_BLT                    (3<<0) #define I915_EXEC3_VEBOX                  (4<<0)

#define I915_EXEC3_SECURE               (1<<6) #define I915_EXEC3_IS_PINNED            (1<<7)

#define I915_EXEC3_BSD_SHIFT     (8) #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)

#define I915_EXEC3_FENCE_IN             (1<<10) #define I915_EXEC3_FENCE_OUT            (1<<11) #define I915_EXEC3_FENCE_SUBMIT         (1<<12)

__u64 in_out_fence;        /* previously execbuffer2.rsvd2 */

__u64 extensions;        /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };

With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

...
Niranjana

...
-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Matthew Brost

9:24 p.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On Wed, Jun 08, 2022 at 10:12:45AM +0300, Lionel Landwerlin wrote:

...

On 03/06/2022 09:53, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 10:08:35PM -0700, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 01, 2022 at 11:27:17AM +0200, Daniel Vetter wrote:

...
On Wed, 1 Jun 2022 at 11:03, Dave Airlie airlied@gmail.com wrote:

...
On Tue, 24 May 2022 at 05:20, Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:

...
On Thu, May 19, 2022 at 04:07:30PM -0700, Zanoni, Paulo R wrote: >On Tue, 2022-05-17 at 11:32 -0700, Niranjana Vishwanathapura wrote: >> VM_BIND and related uapi definitions >> >> v2: Ensure proper kernel-doc formatting with cross references. >>     Also add new uapi and documentation as per review comments >>     from Daniel. >> >> Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com >> --- >> Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ >> 1 file changed, 399 insertions(+) >> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h >> new file mode 100644 >> index 000000000000..589c0a009107 >> --- /dev/null >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >> @@ -0,0 +1,399 @@ >> +/* SPDX-License-Identifier: MIT */ >> +/* >> + * Copyright © 2022 Intel Corporation >> + */ >> + >> +/** >> + * DOC: I915_PARAM_HAS_VM_BIND >> + * >> + * VM_BIND feature availability. >> + * See typedef drm_i915_getparam_t param. >> + */ >> +#define I915_PARAM_HAS_VM_BIND               57 >> + >> +/** >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >> + * >> + * Flag to opt-in for VM_BIND mode of binding during VM creation. >> + * See struct drm_i915_gem_vm_control flags. >> + * >> + * A VM in VM_BIND mode will not support the older execbuff mode of binding. >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided >> + * to pass in the batch buffer addresses. >> + * >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0 >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >> + * The buffers_ptr, buffer_count, batch_start_offset and batch_len fields >> + * of struct drm_i915_gem_execbuffer2 are also not used and must be 0. >> + */ > >From that description, it seems we have: > >struct drm_i915_gem_execbuffer2 { >        __u64 buffers_ptr;              -> must be 0 (new) >        __u32 buffer_count;             -> must be 0 (new) >        __u32 batch_start_offset;       -> must be 0 (new) >        __u32 batch_len;                -> must be 0 (new) >        __u32 DR1;                      -> must be 0 (old) >        __u32 DR4;                      -> must be 0 (old) >        __u32 num_cliprects; (fences)   -> must be 0 since using extensions >        __u64 cliprects_ptr; (fences, extensions) -> contains an actual pointer! >        __u64 flags;                    -> some flags must be 0 (new) >        __u64 rsvd1; (context info)     -> repurposed field (old) >        __u64 rsvd2;                    -> unused >}; > >Based on that, why can't we just get drm_i915_gem_execbuffer3 instead >of adding even more complexity to an already abused interface? While >the Vulkan-like extension thing is really nice, I don't think what >we're doing here is extending the ioctl usage, we're completely >changing how the base struct should be interpreted based on how the VM >was created (which is an entirely different ioctl). > >From Rusty Russel's API Design grading, drm_i915_gem_execbuffer2 is >already at -6 without these changes. I think after vm_bind we'll need >to create a -11 entry just to deal with this ioctl. >

The only change here is removing the execlist support for VM_BIND mode (other than natual extensions). Adding a new execbuffer3 was considered, but I think we need to be careful with that as that goes beyond the VM_BIND support, including any future requirements (as we don't want an execbuffer4 after VM_BIND).

Why not? it's not like adding extensions here is really that different than adding new ioctls.

I definitely think this deserves an execbuffer3 without even considering future requirements. Just to burn down the old requirements and pointless fields.

Make execbuffer3 be vm bind only, no relocs, no legacy bits, leave the older sw on execbuf2 for ever.

I guess another point in favour of execbuf3 would be that it's less midlayer. If we share the entry point then there's quite a few vfuncs needed to cleanly split out the vm_bind paths from the legacy reloc/softping paths.

If we invert this and do execbuf3, then there's the existing ioctl vfunc, and then we share code (where it even makes sense, probably request setup/submit need to be shared, anything else is probably cleaner to just copypaste) with the usual helper approach.

Also that would guarantee that really none of the old concepts like i915_active on the vma or vma open counts and all that stuff leaks into the new vm_bind execbuf.

Finally I also think that copypasting would make backporting easier, or at least more flexible, since it should make it easier to have the upstream vm_bind co-exist with all the other things we have. Without huge amounts of conflicts (or at least much less) that pushing a pile of vfuncs into the existing code would cause.

So maybe we should do this?

Thanks Dave, Daniel. There are a few things that will be common between execbuf2 and execbuf3, like request setup/submit (as you said), fence handling (timeline fences, fence array, composite fences), engine selection, etc. Also, many of the 'flags' will be there in execbuf3 also (but bit position will differ). But I guess these should be fine as the suggestion here is to copy-paste the execbuff code and having a shared code where possible. Besides, we can stop supporting some older feature in execbuff3 (like fence array in favor of newer timeline fences), which will further reduce common code.

Ok, I will update this series by adding execbuf3 and send out soon.

Does this sound reasonable?

struct drm_i915_gem_execbuffer3 {        __u32 ctx_id;        /* previously execbuffer2.rsvd1 */

__u32 batch_count;        __u64 batch_addr_ptr;    /* Pointer to an array of batch gpu virtual addresses */

Quick question raised on IRC about the batches : Are multiple batches limited to virtual engines?

Parallel engines, see i915_context_engines_parallel_submit in i915_drm.h.

Currently the media UMD uses this uAPI to do split frame (e.g. run multiple batches in parallel on the video engines to decode a 8k frame).

Of course there could be future users of this uAPI too.

Matt

...

Thanks,

-Lionel

...
__u64 flags; #define I915_EXEC3_RING_MASK              (0x3f) #define I915_EXEC3_DEFAULT                (0<<0) #define I915_EXEC3_RENDER                 (1<<0) #define I915_EXEC3_BSD                    (2<<0) #define I915_EXEC3_BLT                    (3<<0) #define I915_EXEC3_VEBOX                  (4<<0)

#define I915_EXEC3_SECURE               (1<<6) #define I915_EXEC3_IS_PINNED            (1<<7)

#define I915_EXEC3_BSD_SHIFT     (8) #define I915_EXEC3_BSD_MASK      (3 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_DEFAULT   (0 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING1     (1 << I915_EXEC3_BSD_SHIFT) #define I915_EXEC3_BSD_RING2     (2 << I915_EXEC3_BSD_SHIFT)

#define I915_EXEC3_FENCE_IN             (1<<10) #define I915_EXEC3_FENCE_OUT            (1<<11) #define I915_EXEC3_FENCE_SUBMIT         (1<<12)

__u64 in_out_fence;        /* previously execbuffer2.rsvd2 */

__u64 extensions;        /* currently only for DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES */ };

With this, user can pass in batch addresses and count directly, instead of as an extension (as this rfc series was proposing).

I have removed many of the flags which were either legacy or not applicable to BM_BIND mode. I have also removed fence array support (execbuffer2.cliprects_ptr) as we have timeline fence array support. Is that fine? Do we still need FENCE_IN/FENCE_OUT/FENCE_SUBMIT support?

Any thing else needs to be added or removed?

Niranjana

...
Niranjana

...
-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Tvrtko Ursulin

7 Jun 7 Jun

10:27 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:

...

VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*

Copyright © 2022 Intel Corporation

*/

+/**

DOC: I915_PARAM_HAS_VM_BIND

VM_BIND feature availability.

See typedef drm_i915_getparam_t param.

*/

+#define I915_PARAM_HAS_VM_BIND 57

+/**

DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND

Flag to opt-in for VM_BIND mode of binding during VM creation.

See struct drm_i915_gem_vm_control flags.

A VM in VM_BIND mode will not support the older execbuff mode of binding.

In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the

&drm_i915_gem_execbuffer2.buffer_count must be 0).

Also, &drm_i915_gem_execbuffer2.batch_start_offset and

&drm_i915_gem_execbuffer2.batch_len must be 0.

DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided

to pass in the batch buffer addresses.

Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and

I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0

(not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be

set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

The buffers_ptr, buffer_count, batch_start_offset and batch_len fields

of struct drm_i915_gem_execbuffer2 are also not used and must be 0.

*/

+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0)

+/**

DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING

Flag to declare context as long running.

See struct drm_i915_gem_context_create_ext flags.

Usage of dma-fence expects that they complete in reasonable amount of time.

Compute on the other hand can be long running. Hence it is not appropriate

for compute contexts to export request completion dma-fence to user.

The dma-fence usage will be limited to in-kernel consumption only.

Compute contexts need to use user/memory fence.

So, long running contexts do not support output fences. Hence,

I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and

I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected

to be not used.

DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped

to long running contexts.

*/

+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2)

+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f

+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)

+/**

struct drm_i915_gem_vm_bind - VA to object mapping to bind.

This structure is passed to VM_BIND ioctl and specifies the mapping of GPU

virtual address (VA) range to the section of an object that should be bound

in the device page table of the specified address space (VM).

The VA range specified must be unique (ie., not currently bound) and can

be mapped to whole object or a section of the object (partial binding).

Multiple VA mappings can be created to the same section of the object

(aliasing).

*/

+struct drm_i915_gem_vm_bind {

/** @vm_id: VM (address space) id to bind */

__u32 vm_id;

/** @handle: Object handle */

__u32 handle;

/** @start: Virtual Address start to bind */

__u64 start;

/** @offset: Offset in object to bind */

__u64 offset;

/** @length: Length of mapping to bind */

__u64 length;

Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the remainder of the space to a dummy object? In which case would there be any alignment/padding issues preventing the two bind to be placed next to each other?

I ask because someone from the compute side asked me about a problem with their strategy of dealing with overfetch and I suggested pad to size.

Regards,

Tvrtko

...

/**
* @flags: Supported flags are,
*
* I915_GEM_VM_BIND_READONLY:
* Mapping is read-only.
*
* I915_GEM_VM_BIND_CAPTURE:
* Capture this mapping in the dump upon GPU error.
*/
__u64 flags;
+#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1)

/** @extensions: 0-terminated chain of extensions for this mapping. */

__u64 extensions;

+};

+/**

struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.

This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual

address (VA) range that should be unbound from the device page table of the

specified address space (VM). The specified VA range must match one of the

mappings created with the VM_BIND ioctl. TLB is flushed upon unbind

completion.

*/

+struct drm_i915_gem_vm_unbind {

/** @vm_id: VM (address space) id to bind */

__u32 vm_id;

/** @rsvd: Reserved for future use; must be zero. */

__u32 rsvd;

/** @start: Virtual Address start to unbind */

__u64 start;

/** @length: Length of mapping to unbind */

__u64 length;

/** @flags: reserved for future usage, currently MBZ */

__u64 flags;

/** @extensions: 0-terminated chain of extensions for this mapping. */

__u64 extensions;

+};

+/**

struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind

or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for input fence to signal

before starting the binding or unbinding.

The vm_bind or vm_unbind async worker will signal the returned output fence

after the completion of binding or unbinding.

*/

+struct drm_i915_vm_bind_fence {
/** @handle: User's handle for a drm_syncobj to wait on or signal. */

__u32 handle;

/**
* @flags: Supported flags are,
*
* I915_VM_BIND_FENCE_WAIT:
* Wait for the input fence before binding/unbinding
*
* I915_VM_BIND_FENCE_SIGNAL:
* Return bind/unbind completion fence as output
*/
__u32 flags;
+#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};

+/**

struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind

and vm_unbind.

This structure describes an array of timeline drm_syncobj and associated

points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's

can be input or output fences (See struct drm_i915_vm_bind_fence).

*/

+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0
/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/**
* @fence_count: Number of elements in the @handles_ptr & @value_ptr
* arrays.
*/
__u64 fence_count;

/**
* @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
* of length @fence_count.
*/
__u64 handles_ptr;

/**
* @values_ptr: Pointer to an array of u64 values of length
* @fence_count.
* Values must be 0 for a binary drm_syncobj. A Value of 0 for a
* timeline drm_syncobj is invalid as it turns a drm_syncobj into a
* binary one.
*/
__u64 values_ptr;
+};

+/**

struct drm_i915_vm_bind_user_fence - An input or output user fence for the

vm_bind or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for the input fence (value at

@addr to become equal to @val) before starting the binding or unbinding.

The vm_bind or vm_unbind async worker will signal the output fence after

the completion of binding or unbinding by writing @val to memory location at

@addr

*/

+struct drm_i915_vm_bind_user_fence {
/** @addr: User/Memory fence qword aligned process virtual address */

__u64 addr;

/** @val: User/Memory fence value to be written after bind completion */

__u64 val;

/**
* @flags: Supported flags are,
*
* I915_VM_BIND_USER_FENCE_WAIT:
* Wait for the input fence before binding/unbinding
*
* I915_VM_BIND_USER_FENCE_SIGNAL:
* Return bind/unbind completion fence as output
*/
__u32 flags;
+#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \

(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))

+};

+/**

struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind

and vm_unbind.

These user fences can be input or output fences

(See struct drm_i915_vm_bind_user_fence).

*/

+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1
/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/** @fence_count: Number of elements in the @user_fence_ptr array. */

__u64 fence_count;

/**
* @user_fence_ptr: Pointer to an array of
* struct drm_i915_vm_bind_user_fence of length @fence_count.
*/
__u64 user_fence_ptr;
+};

+/**

struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer

gpu virtual addresses.

In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension

must always be appended in the VM_BIND mode and it will be an error to

append this extension in older non-VM_BIND mode.

*/

+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1

/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/** @count: Number of addresses in the addr array. */

__u32 count;

/** @addr: An array of batch gpu virtual addresses. */

__u64 addr[0];

+};

+/**

struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion

signaling extension.

This extension allows user to attach a user fence (@addr, @value pair) to an

execbuf to be signaled by the command streamer after the completion of first

level batch, by writing the @value at specified @addr and triggering an

interrupt.

User can either poll for this user fence to signal or can also wait on it

with i915_gem_wait_user_fence ioctl.

This is very much usefaul for long running contexts where waiting on dma-fence

by user (like i915_gem_wait ioctl) is not supported.

*/

+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2
/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/**
* @addr: User/Memory fence qword aligned GPU virtual address.
*
* Address has to be a valid GPU virtual address at the time of
* first level batch completion.
*/
__u64 addr;

/**
* @value: User/Memory fence Value to be written to above address
* after first level batch completes.
*/
__u64 value;

/** @rsvd: Reserved for future extensions, MBZ */

__u64 rsvd;
+};

+/**

struct drm_i915_gem_create_ext_vm_private - Extension to make the object

private to the specified VM.

See struct drm_i915_gem_create_ext.

*/

+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2

/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/** @vm_id: Id of the VM to which the object is private */

__u32 vm_id;

+};

+/**

struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.

User/Memory fence can be woken up either by:

GPU context indicated by @ctx_id, or,

Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.

@ctx_id is ignored when this flag is set.

Wakeup condition is,

``((*addr & mask) op (value & mask))``

See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`

*/

+struct drm_i915_gem_wait_user_fence {

/** @extensions: Zero-terminated chain of extensions. */

__u64 extensions;

/** @addr: User/Memory fence address */

__u64 addr;

/** @ctx_id: Id of the Context which will signal the fence. */

__u32 ctx_id;

/** @op: Wakeup condition operator */

__u16 op;

+#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7
/**
* @flags: Supported flags are,
*
* I915_UFENCE_WAIT_SOFT:
*
* To be woken up by i915 driver async worker (not by GPU).
*
* I915_UFENCE_WAIT_ABSTIME:
*
* Wait timeout specified as absolute time.
*/
__u16 flags;
+#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2

/** @value: Wakeup value */

__u64 value;

/** @mask: Wakeup mask */

__u64 mask;

+#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull
/**
* @timeout: Wait timeout in nanoseconds.
*
* If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
* absolute time in nsec.
*/
__s64 timeout;
+};

Niranjana Vishwanathapura

7:37 p.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:

...

On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:

...
VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*

Copyright © 2022 Intel Corporation

*/

+/**

DOC: I915_PARAM_HAS_VM_BIND

VM_BIND feature availability.

See typedef drm_i915_getparam_t param.

*/

+#define I915_PARAM_HAS_VM_BIND 57

+/**

DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND

Flag to opt-in for VM_BIND mode of binding during VM creation.

See struct drm_i915_gem_vm_control flags.

A VM in VM_BIND mode will not support the older execbuff mode of binding.

In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the

&drm_i915_gem_execbuffer2.buffer_count must be 0).

Also, &drm_i915_gem_execbuffer2.batch_start_offset and

&drm_i915_gem_execbuffer2.batch_len must be 0.

DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided

to pass in the batch buffer addresses.

Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and

I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0

(not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be

set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

The buffers_ptr, buffer_count, batch_start_offset and batch_len fields

of struct drm_i915_gem_execbuffer2 are also not used and must be 0.

*/

+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0)

+/**

DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING

Flag to declare context as long running.

See struct drm_i915_gem_context_create_ext flags.

Usage of dma-fence expects that they complete in reasonable amount of time.

Compute on the other hand can be long running. Hence it is not appropriate

for compute contexts to export request completion dma-fence to user.

The dma-fence usage will be limited to in-kernel consumption only.

Compute contexts need to use user/memory fence.

So, long running contexts do not support output fences. Hence,

I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and

I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected

to be not used.

DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped

to long running contexts.

*/

+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2)

+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f

+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)

+/**

struct drm_i915_gem_vm_bind - VA to object mapping to bind.

This structure is passed to VM_BIND ioctl and specifies the mapping of GPU

virtual address (VA) range to the section of an object that should be bound

in the device page table of the specified address space (VM).

The VA range specified must be unique (ie., not currently bound) and can

be mapped to whole object or a section of the object (partial binding).

Multiple VA mappings can be created to the same section of the object

(aliasing).

*/

+struct drm_i915_gem_vm_bind {

/** @vm_id: VM (address space) id to bind */

__u32 vm_id;

/** @handle: Object handle */

__u32 handle;

/** @start: Virtual Address start to bind */

__u64 start;

/** @offset: Offset in object to bind */

__u64 offset;

/** @length: Length of mapping to bind */

__u64 length;

Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the remainder of the space to a dummy object? In which case would there be any alignment/padding issues preventing the two bind to be placed next to each other?

I ask because someone from the compute side asked me about a problem with their strategy of dealing with overfetch and I suggested pad to size.

Thanks Tvrtko, I think we shouldn't be needing it. As with VM_BIND VA assignment is completely pushed to userspace, no padding should be necessary once the 'start' and 'size' alignment conditions are met.

I will add some documentation on alignment requirement here. Generally, 'start' and 'size' should be 4K aligned. But, I think when we have 64K lmem page sizes (dg2 and xehpsdv), they need to be 64K aligned.

Niranjana

...

Regards,

Tvrtko

...
/**
* @flags: Supported flags are,
*
* I915_GEM_VM_BIND_READONLY:
* Mapping is read-only.
*
* I915_GEM_VM_BIND_CAPTURE:
* Capture this mapping in the dump upon GPU error.
*/
__u64 flags;
+#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1)

/** @extensions: 0-terminated chain of extensions for this mapping. */

__u64 extensions;

+};

+/**

struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.

This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual

address (VA) range that should be unbound from the device page table of the

specified address space (VM). The specified VA range must match one of the

mappings created with the VM_BIND ioctl. TLB is flushed upon unbind

completion.

*/

+struct drm_i915_gem_vm_unbind {

/** @vm_id: VM (address space) id to bind */

__u32 vm_id;

/** @rsvd: Reserved for future use; must be zero. */

__u32 rsvd;

/** @start: Virtual Address start to unbind */

__u64 start;

/** @length: Length of mapping to unbind */

__u64 length;

/** @flags: reserved for future usage, currently MBZ */

__u64 flags;

/** @extensions: 0-terminated chain of extensions for this mapping. */

__u64 extensions;

+};

+/**

struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind

or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for input fence to signal

before starting the binding or unbinding.

The vm_bind or vm_unbind async worker will signal the returned output fence

after the completion of binding or unbinding.

*/

+struct drm_i915_vm_bind_fence {
/** @handle: User's handle for a drm_syncobj to wait on or signal. */

__u32 handle;

/**
* @flags: Supported flags are,
*
* I915_VM_BIND_FENCE_WAIT:
* Wait for the input fence before binding/unbinding
*
* I915_VM_BIND_FENCE_SIGNAL:
* Return bind/unbind completion fence as output
*/
__u32 flags;
+#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};

+/**

struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind

and vm_unbind.

This structure describes an array of timeline drm_syncobj and associated

points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's

can be input or output fences (See struct drm_i915_vm_bind_fence).

*/

+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0
/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/**
* @fence_count: Number of elements in the @handles_ptr & @value_ptr
* arrays.
*/
__u64 fence_count;

/**
* @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
* of length @fence_count.
*/
__u64 handles_ptr;

/**
* @values_ptr: Pointer to an array of u64 values of length
* @fence_count.
* Values must be 0 for a binary drm_syncobj. A Value of 0 for a
* timeline drm_syncobj is invalid as it turns a drm_syncobj into a
* binary one.
*/
__u64 values_ptr;
+};

+/**

struct drm_i915_vm_bind_user_fence - An input or output user fence for the

vm_bind or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for the input fence (value at

@addr to become equal to @val) before starting the binding or unbinding.

The vm_bind or vm_unbind async worker will signal the output fence after

the completion of binding or unbinding by writing @val to memory location at

@addr

*/

+struct drm_i915_vm_bind_user_fence {
/** @addr: User/Memory fence qword aligned process virtual address */

__u64 addr;

/** @val: User/Memory fence value to be written after bind completion */

__u64 val;

/**
* @flags: Supported flags are,
*
* I915_VM_BIND_USER_FENCE_WAIT:
* Wait for the input fence before binding/unbinding
*
* I915_VM_BIND_USER_FENCE_SIGNAL:
* Return bind/unbind completion fence as output
*/
__u32 flags;
+#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \

(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))

+};

+/**

struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind

and vm_unbind.

These user fences can be input or output fences

(See struct drm_i915_vm_bind_user_fence).

*/

+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1
/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/** @fence_count: Number of elements in the @user_fence_ptr array. */

__u64 fence_count;

/**
* @user_fence_ptr: Pointer to an array of
* struct drm_i915_vm_bind_user_fence of length @fence_count.
*/
__u64 user_fence_ptr;
+};

+/**

struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer

gpu virtual addresses.

In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension

must always be appended in the VM_BIND mode and it will be an error to

append this extension in older non-VM_BIND mode.

*/

+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1

/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/** @count: Number of addresses in the addr array. */

__u32 count;

/** @addr: An array of batch gpu virtual addresses. */

__u64 addr[0];

+};

+/**

struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion

signaling extension.

This extension allows user to attach a user fence (@addr, @value pair) to an

execbuf to be signaled by the command streamer after the completion of first

level batch, by writing the @value at specified @addr and triggering an

interrupt.

User can either poll for this user fence to signal or can also wait on it

with i915_gem_wait_user_fence ioctl.

This is very much usefaul for long running contexts where waiting on dma-fence

by user (like i915_gem_wait ioctl) is not supported.

*/

+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2
/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/**
* @addr: User/Memory fence qword aligned GPU virtual address.
*
* Address has to be a valid GPU virtual address at the time of
* first level batch completion.
*/
__u64 addr;

/**
* @value: User/Memory fence Value to be written to above address
* after first level batch completes.
*/
__u64 value;

/** @rsvd: Reserved for future extensions, MBZ */

__u64 rsvd;
+};

+/**

struct drm_i915_gem_create_ext_vm_private - Extension to make the object

private to the specified VM.

See struct drm_i915_gem_create_ext.

*/

+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2

/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/** @vm_id: Id of the VM to which the object is private */

__u32 vm_id;

+};

+/**

struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.

User/Memory fence can be woken up either by:

GPU context indicated by @ctx_id, or,

Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.

@ctx_id is ignored when this flag is set.

Wakeup condition is,

``((*addr & mask) op (value & mask))``

See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`

*/

+struct drm_i915_gem_wait_user_fence {

/** @extensions: Zero-terminated chain of extensions. */

__u64 extensions;

/** @addr: User/Memory fence address */

__u64 addr;

/** @ctx_id: Id of the Context which will signal the fence. */

__u32 ctx_id;

/** @op: Wakeup condition operator */

__u16 op;

+#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7
/**
* @flags: Supported flags are,
*
* I915_UFENCE_WAIT_SOFT:
*
* To be woken up by i915 driver async worker (not by GPU).
*
* I915_UFENCE_WAIT_ABSTIME:
*
* Wait timeout specified as absolute time.
*/
__u16 flags;
+#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2

/** @value: Wakeup value */

__u64 value;

/** @mask: Wakeup mask */

__u64 mask;

+#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull
/**
* @timeout: Wait timeout in nanoseconds.
*
* If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
* absolute time in nsec.
*/
__s64 timeout;
+};

Tvrtko Ursulin

8 Jun 8 Jun

7:17 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:

...

On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:

...
On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:

...
VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references.     Also add new uapi and documentation as per review comments     from Daniel.

Signed-off-by: Niranjana Vishwanathapura

niranjana.vishwanathapura@intel.com

Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*

Copyright © 2022 Intel Corporation

*/

+/**

DOC: I915_PARAM_HAS_VM_BIND

VM_BIND feature availability.

See typedef drm_i915_getparam_t param.

*/

+#define I915_PARAM_HAS_VM_BIND        57

+/**

DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND

Flag to opt-in for VM_BIND mode of binding during VM creation.

See struct drm_i915_gem_vm_control flags.

A VM in VM_BIND mode will not support the older execbuff mode of

binding.

In VM_BIND mode, execbuff ioctl will not accept any execlist

(ie., the

&drm_i915_gem_execbuffer2.buffer_count must be 0).

Also, &drm_i915_gem_execbuffer2.batch_start_offset and

&drm_i915_gem_execbuffer2.batch_len must be 0.

DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be

provided

to pass in the batch buffer addresses.

Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and

I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0

(not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must

always be

set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

The buffers_ptr, buffer_count, batch_start_offset and batch_len

fields

of struct drm_i915_gem_execbuffer2 are also not used and must be 0.

*/

+#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)

+/**

DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING

Flag to declare context as long running.

See struct drm_i915_gem_context_create_ext flags.

Usage of dma-fence expects that they complete in reasonable

amount of time.

Compute on the other hand can be long running. Hence it is not

appropriate

for compute contexts to export request completion dma-fence to user.

The dma-fence usage will be limited to in-kernel consumption only.

Compute contexts need to use user/memory fence.

So, long running contexts do not support output fences. Hence,

I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and

I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are

expected

to be not used.

DRM_I915_GEM_WAIT ioctl call is also not supported for objects

mapped

to long running contexts.

*/

+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)

+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND        0x3d +#define DRM_I915_GEM_VM_UNBIND        0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f

+#define DRM_IOCTL_I915_GEM_VM_BIND        DRM_IOWR(DRM_COMMAND_BASE

DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)

+#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)

+/**

struct drm_i915_gem_vm_bind - VA to object mapping to bind.

This structure is passed to VM_BIND ioctl and specifies the

mapping of GPU

virtual address (VA) range to the section of an object that

should be bound

in the device page table of the specified address space (VM).

The VA range specified must be unique (ie., not currently bound)

and can

be mapped to whole object or a section of the object (partial

binding).

Multiple VA mappings can be created to the same section of the

object

(aliasing).

*/

+struct drm_i915_gem_vm_bind { +    /** @vm_id: VM (address space) id to bind */ +    __u32 vm_id;

+    /** @handle: Object handle */ +    __u32 handle;

+    /** @start: Virtual Address start to bind */ +    __u64 start;

+    /** @offset: Offset in object to bind */ +    __u64 offset;

+    /** @length: Length of mapping to bind */ +    __u64 length;

Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the remainder of the space to a dummy object? In which case would there be any alignment/padding issues preventing the two bind to be placed next to each other?

I ask because someone from the compute side asked me about a problem with their strategy of dealing with overfetch and I suggested pad to size.

Thanks Tvrtko, I think we shouldn't be needing it. As with VM_BIND VA assignment is completely pushed to userspace, no padding should be necessary once the 'start' and 'size' alignment conditions are met.

I will add some documentation on alignment requirement here. Generally, 'start' and 'size' should be 4K aligned. But, I think when we have 64K lmem page sizes (dg2 and xehpsdv), they need to be 64K aligned.

+ Matt

Align to 64k is enough for all overfetch issues?

Apparently compute has a situation where a buffer is received by one component and another has to apply more alignment to it, to deal with overfetch. Since they cannot grow the actual BO if they wanted to VM_BIND a scratch area on top? Or perhaps none of this is a problem on discrete and original BO should be correctly allocated to start with.

Side question - what about the align to 2MiB mentioned in i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to discrete?

Regards,

Tvrtko

...

Niranjana

...
Regards,

Tvrtko

...

+    /** +     * @flags: Supported flags are, +     * +     * I915_GEM_VM_BIND_READONLY: +     * Mapping is read-only. +     * +     * I915_GEM_VM_BIND_CAPTURE: +     * Capture this mapping in the dump upon GPU error. +     */ +    __u64 flags; +#define I915_GEM_VM_BIND_READONLY    (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)

+    /** @extensions: 0-terminated chain of extensions for this mapping. */ +    __u64 extensions; +};

+/**

struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.

This structure is passed to VM_UNBIND ioctl and specifies the GPU

virtual

address (VA) range that should be unbound from the device page

table of the

specified address space (VM). The specified VA range must match

one of the

mappings created with the VM_BIND ioctl. TLB is flushed upon unbind

completion.

*/

+struct drm_i915_gem_vm_unbind { +    /** @vm_id: VM (address space) id to bind */ +    __u32 vm_id;

+    /** @rsvd: Reserved for future use; must be zero. */ +    __u32 rsvd;

+    /** @start: Virtual Address start to unbind */ +    __u64 start;

+    /** @length: Length of mapping to unbind */ +    __u64 length;

+    /** @flags: reserved for future usage, currently MBZ */ +    __u64 flags;

+    /** @extensions: 0-terminated chain of extensions for this mapping. */ +    __u64 extensions; +};

+/**

struct drm_i915_vm_bind_fence - An input or output fence for the

vm_bind

or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for input fence to

signal

before starting the binding or unbinding.

The vm_bind or vm_unbind async worker will signal the returned

output fence

after the completion of binding or unbinding.

*/

+struct drm_i915_vm_bind_fence { +    /** @handle: User's handle for a drm_syncobj to wait on or signal. */ +    __u32 handle;

+    /** +     * @flags: Supported flags are, +     * +     * I915_VM_BIND_FENCE_WAIT: +     * Wait for the input fence before binding/unbinding +     * +     * I915_VM_BIND_FENCE_SIGNAL: +     * Return bind/unbind completion fence as output +     */ +    __u32 flags; +#define I915_VM_BIND_FENCE_WAIT            (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};

+/**

struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for

vm_bind

and vm_unbind.

This structure describes an array of timeline drm_syncobj and

associated

points for timeline variants of drm_syncobj. These timeline

'drm_syncobj's

can be input or output fences (See struct drm_i915_vm_bind_fence).

*/

+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES    0 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** +     * @fence_count: Number of elements in the @handles_ptr & @value_ptr +     * arrays. +     */ +    __u64 fence_count;

+    /** +     * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence +     * of length @fence_count. +     */ +    __u64 handles_ptr;

+    /** +     * @values_ptr: Pointer to an array of u64 values of length +     * @fence_count. +     * Values must be 0 for a binary drm_syncobj. A Value of 0 for a +     * timeline drm_syncobj is invalid as it turns a drm_syncobj into a +     * binary one. +     */ +    __u64 values_ptr; +};

+/**

struct drm_i915_vm_bind_user_fence - An input or output user

fence for the

vm_bind or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for the input

fence (value at

@addr to become equal to @val) before starting the binding or

unbinding.

The vm_bind or vm_unbind async worker will signal the output

fence after

the completion of binding or unbinding by writing @val to memory

location at

@addr

*/

+struct drm_i915_vm_bind_user_fence { +    /** @addr: User/Memory fence qword aligned process virtual address */ +    __u64 addr;

+    /** @val: User/Memory fence value to be written after bind completion */ +    __u64 val;

+    /** +     * @flags: Supported flags are, +     * +     * I915_VM_BIND_USER_FENCE_WAIT: +     * Wait for the input fence before binding/unbinding +     * +     * I915_VM_BIND_USER_FENCE_SIGNAL: +     * Return bind/unbind completion fence as output +     */ +    __u32 flags; +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ +    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) +};

+/**

struct drm_i915_vm_bind_ext_user_fence - User/memory fences for

vm_bind

and vm_unbind.

These user fences can be input or output fences

(See struct drm_i915_vm_bind_user_fence).

*/

+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES    1 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** @fence_count: Number of elements in the @user_fence_ptr array. */ +    __u64 fence_count;

+    /** +     * @user_fence_ptr: Pointer to an array of +     * struct drm_i915_vm_bind_user_fence of length @fence_count. +     */ +    __u64 user_fence_ptr; +};

+/**

struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of

batch buffer

gpu virtual addresses.

In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this

extension

must always be appended in the VM_BIND mode and it will be an

error to

append this extension in older non-VM_BIND mode.

*/

+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** @count: Number of addresses in the addr array. */ +    __u32 count;

+    /** @addr: An array of batch gpu virtual addresses. */ +    __u64 addr[0]; +};

+/**

struct drm_i915_gem_execbuffer_ext_user_fence - First level batch

completion

signaling extension.

This extension allows user to attach a user fence (@addr, @value

pair) to an

execbuf to be signaled by the command streamer after the

completion of first

level batch, by writing the @value at specified @addr and

triggering an

interrupt.

User can either poll for this user fence to signal or can also

wait on it

with i915_gem_wait_user_fence ioctl.

This is very much usefaul for long running contexts where waiting

on dma-fence

by user (like i915_gem_wait ioctl) is not supported.

*/

+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** +     * @addr: User/Memory fence qword aligned GPU virtual address. +     * +     * Address has to be a valid GPU virtual address at the time of +     * first level batch completion. +     */ +    __u64 addr;

+    /** +     * @value: User/Memory fence Value to be written to above address +     * after first level batch completes. +     */ +    __u64 value;

+    /** @rsvd: Reserved for future extensions, MBZ */ +    __u64 rsvd; +};

+/**

struct drm_i915_gem_create_ext_vm_private - Extension to make the

object

private to the specified VM.

See struct drm_i915_gem_create_ext.

*/

+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE        2 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** @vm_id: Id of the VM to which the object is private */ +    __u32 vm_id; +};

+/**

struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.

User/Memory fence can be woken up either by:

GPU context indicated by @ctx_id, or,

Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.

*    @ctx_id is ignored when this flag is set.

Wakeup condition is,

``((*addr & mask) op (value & mask))``

See :ref:`Documentation/driver-api/dma-buf.rst

<indefinite_dma_fences>`

*/

+struct drm_i915_gem_wait_user_fence { +    /** @extensions: Zero-terminated chain of extensions. */ +    __u64 extensions;

+    /** @addr: User/Memory fence address */ +    __u64 addr;

+    /** @ctx_id: Id of the Context which will signal the fence. */ +    __u32 ctx_id;

+    /** @op: Wakeup condition operator */ +    __u16 op; +#define I915_UFENCE_WAIT_EQ      0 +#define I915_UFENCE_WAIT_NEQ     1 +#define I915_UFENCE_WAIT_GT      2 +#define I915_UFENCE_WAIT_GTE     3 +#define I915_UFENCE_WAIT_LT      4 +#define I915_UFENCE_WAIT_LTE     5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER   7

+    /** +     * @flags: Supported flags are, +     * +     * I915_UFENCE_WAIT_SOFT: +     * +     * To be woken up by i915 driver async worker (not by GPU). +     * +     * I915_UFENCE_WAIT_ABSTIME: +     * +     * Wait timeout specified as absolute time. +     */ +    __u16 flags; +#define I915_UFENCE_WAIT_SOFT    0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2

+    /** @value: Wakeup value */ +    __u64 value;

+    /** @mask: Wakeup mask */ +    __u64 mask; +#define I915_UFENCE_WAIT_U8     0xffu +#define I915_UFENCE_WAIT_U16    0xffffu +#define I915_UFENCE_WAIT_U32    0xfffffffful +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull

+    /** +     * @timeout: Wait timeout in nanoseconds. +     * +     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the +     * absolute time in nsec. +     */ +    __s64 timeout; +};

Matthew Auld

9:12 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 08/06/2022 08:17, Tvrtko Ursulin wrote:

...

On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:

...
On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:

...
On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:

...
VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references.     Also add new uapi and documentation as per review comments     from Daniel.

Signed-off-by: Niranjana Vishwanathapura

niranjana.vishwanathapura@intel.com

Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*

Copyright © 2022 Intel Corporation

*/

+/**

DOC: I915_PARAM_HAS_VM_BIND

VM_BIND feature availability.

See typedef drm_i915_getparam_t param.

*/

+#define I915_PARAM_HAS_VM_BIND        57

+/**

DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND

Flag to opt-in for VM_BIND mode of binding during VM creation.

See struct drm_i915_gem_vm_control flags.

A VM in VM_BIND mode will not support the older execbuff mode of

binding.

In VM_BIND mode, execbuff ioctl will not accept any execlist

(ie., the

&drm_i915_gem_execbuffer2.buffer_count must be 0).

Also, &drm_i915_gem_execbuffer2.batch_start_offset and

&drm_i915_gem_execbuffer2.batch_len must be 0.

DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be

provided

to pass in the batch buffer addresses.

Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and

I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0

(not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must

always be

set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

The buffers_ptr, buffer_count, batch_start_offset and batch_len

fields

of struct drm_i915_gem_execbuffer2 are also not used and must be 0.

*/

+#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)

+/**

DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING

Flag to declare context as long running.

See struct drm_i915_gem_context_create_ext flags.

Usage of dma-fence expects that they complete in reasonable

amount of time.

Compute on the other hand can be long running. Hence it is not

appropriate

for compute contexts to export request completion dma-fence to

user.

The dma-fence usage will be limited to in-kernel consumption only.

Compute contexts need to use user/memory fence.

So, long running contexts do not support output fences. Hence,

I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and

I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are

expected

to be not used.

DRM_I915_GEM_WAIT ioctl call is also not supported for objects

mapped

to long running contexts.

*/

+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)

+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND        0x3d +#define DRM_I915_GEM_VM_UNBIND        0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f

+#define DRM_IOCTL_I915_GEM_VM_BIND        DRM_IOWR(DRM_COMMAND_BASE

DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind)

+#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)

+/**

struct drm_i915_gem_vm_bind - VA to object mapping to bind.

This structure is passed to VM_BIND ioctl and specifies the

mapping of GPU

virtual address (VA) range to the section of an object that

should be bound

in the device page table of the specified address space (VM).

The VA range specified must be unique (ie., not currently bound)

and can

be mapped to whole object or a section of the object (partial

binding).

Multiple VA mappings can be created to the same section of the

object

(aliasing).

*/

+struct drm_i915_gem_vm_bind { +    /** @vm_id: VM (address space) id to bind */ +    __u32 vm_id;

+    /** @handle: Object handle */ +    __u32 handle;

+    /** @start: Virtual Address start to bind */ +    __u64 start;

+    /** @offset: Offset in object to bind */ +    __u64 offset;

+    /** @length: Length of mapping to bind */ +    __u64 length;

Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the remainder of the space to a dummy object? In which case would there be any alignment/padding issues preventing the two bind to be placed next to each other?

I ask because someone from the compute side asked me about a problem with their strategy of dealing with overfetch and I suggested pad to size.

Thanks Tvrtko, I think we shouldn't be needing it. As with VM_BIND VA assignment is completely pushed to userspace, no padding should be necessary once the 'start' and 'size' alignment conditions are met.

I will add some documentation on alignment requirement here. Generally, 'start' and 'size' should be 4K aligned. But, I think when we have 64K lmem page sizes (dg2 and xehpsdv), they need to be 64K aligned.

Matt

Align to 64k is enough for all overfetch issues?

Apparently compute has a situation where a buffer is received by one component and another has to apply more alignment to it, to deal with overfetch. Since they cannot grow the actual BO if they wanted to VM_BIND a scratch area on top? Or perhaps none of this is a problem on discrete and original BO should be correctly allocated to start with.

Side question - what about the align to 2MiB mentioned in i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to discrete?

Not sure about the overfetch thing, but yeah dg2 & xehpsdv both require a minimum of 64K pages underneath for local memory, and the BO size will also be rounded up accordingly. And yeah the complication arises due to not being able to mix 4K + 64K GTT pages within the same page-table (existed since even gen8). Note that 4K here is what we typically get for system memory.

Originally we had a memory coloring scheme to track the "color" of each page-table, which basically ensures that userspace can't do something nasty like mixing page sizes. The advantage of that scheme is that we would only require 64K GTT alignment and no extra padding, but is perhaps a little complex.

The merged solution is just to align and pad (i.e vma->node.size and not vma->size) out of the vma to 2M, which is dead simple implementation wise, but does potentially waste some GTT space and some of the local memory used for the actual page-table. For the alignment the kernel just validates that the GTT address is aligned to 2M in vma_insert(), and then for the padding it just inflates it to 2M, if userspace hasn't already.

See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_cr...

...

Regards,

Tvrtko

...
Niranjana

...
Regards,

Tvrtko

...

+    /** +     * @flags: Supported flags are, +     * +     * I915_GEM_VM_BIND_READONLY: +     * Mapping is read-only. +     * +     * I915_GEM_VM_BIND_CAPTURE: +     * Capture this mapping in the dump upon GPU error. +     */ +    __u64 flags; +#define I915_GEM_VM_BIND_READONLY    (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)

+    /** @extensions: 0-terminated chain of extensions for this mapping. */ +    __u64 extensions; +};

+/**

struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.

This structure is passed to VM_UNBIND ioctl and specifies the

GPU virtual

address (VA) range that should be unbound from the device page

table of the

specified address space (VM). The specified VA range must match

one of the

mappings created with the VM_BIND ioctl. TLB is flushed upon unbind

completion.

*/

+struct drm_i915_gem_vm_unbind { +    /** @vm_id: VM (address space) id to bind */ +    __u32 vm_id;

+    /** @rsvd: Reserved for future use; must be zero. */ +    __u32 rsvd;

+    /** @start: Virtual Address start to unbind */ +    __u64 start;

+    /** @length: Length of mapping to unbind */ +    __u64 length;

+    /** @flags: reserved for future usage, currently MBZ */ +    __u64 flags;

+    /** @extensions: 0-terminated chain of extensions for this mapping. */ +    __u64 extensions; +};

+/**

struct drm_i915_vm_bind_fence - An input or output fence for the

vm_bind

or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for input fence

to signal

before starting the binding or unbinding.

The vm_bind or vm_unbind async worker will signal the returned

output fence

after the completion of binding or unbinding.

*/

+struct drm_i915_vm_bind_fence { +    /** @handle: User's handle for a drm_syncobj to wait on or signal. */ +    __u32 handle;

+    /** +     * @flags: Supported flags are, +     * +     * I915_VM_BIND_FENCE_WAIT: +     * Wait for the input fence before binding/unbinding +     * +     * I915_VM_BIND_FENCE_SIGNAL: +     * Return bind/unbind completion fence as output +     */ +    __u32 flags; +#define I915_VM_BIND_FENCE_WAIT            (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};

+/**

struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences

for vm_bind

and vm_unbind.

This structure describes an array of timeline drm_syncobj and

associated

points for timeline variants of drm_syncobj. These timeline

'drm_syncobj's

can be input or output fences (See struct drm_i915_vm_bind_fence).

*/

+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES    0 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** +     * @fence_count: Number of elements in the @handles_ptr & @value_ptr +     * arrays. +     */ +    __u64 fence_count;

+    /** +     * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence +     * of length @fence_count. +     */ +    __u64 handles_ptr;

+    /** +     * @values_ptr: Pointer to an array of u64 values of length +     * @fence_count. +     * Values must be 0 for a binary drm_syncobj. A Value of 0 for a +     * timeline drm_syncobj is invalid as it turns a drm_syncobj into a +     * binary one. +     */ +    __u64 values_ptr; +};

+/**

struct drm_i915_vm_bind_user_fence - An input or output user

fence for the

vm_bind or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for the input

fence (value at

@addr to become equal to @val) before starting the binding or

unbinding.

The vm_bind or vm_unbind async worker will signal the output

fence after

the completion of binding or unbinding by writing @val to memory

location at

@addr

*/

+struct drm_i915_vm_bind_user_fence { +    /** @addr: User/Memory fence qword aligned process virtual address */ +    __u64 addr;

+    /** @val: User/Memory fence value to be written after bind completion */ +    __u64 val;

+    /** +     * @flags: Supported flags are, +     * +     * I915_VM_BIND_USER_FENCE_WAIT: +     * Wait for the input fence before binding/unbinding +     * +     * I915_VM_BIND_USER_FENCE_SIGNAL: +     * Return bind/unbind completion fence as output +     */ +    __u32 flags; +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ +    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) +};

+/**

struct drm_i915_vm_bind_ext_user_fence - User/memory fences for

vm_bind

and vm_unbind.

These user fences can be input or output fences

(See struct drm_i915_vm_bind_user_fence).

*/

+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES    1 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** @fence_count: Number of elements in the @user_fence_ptr array. */ +    __u64 fence_count;

+    /** +     * @user_fence_ptr: Pointer to an array of +     * struct drm_i915_vm_bind_user_fence of length @fence_count. +     */ +    __u64 user_fence_ptr; +};

+/**

struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of

batch buffer

gpu virtual addresses.

In the execbuff ioctl (See struct drm_i915_gem_execbuffer2),

this extension

must always be appended in the VM_BIND mode and it will be an

error to

append this extension in older non-VM_BIND mode.

*/

+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** @count: Number of addresses in the addr array. */ +    __u32 count;

+    /** @addr: An array of batch gpu virtual addresses. */ +    __u64 addr[0]; +};

+/**

struct drm_i915_gem_execbuffer_ext_user_fence - First level

batch completion

signaling extension.

This extension allows user to attach a user fence (@addr, @value

pair) to an

execbuf to be signaled by the command streamer after the

completion of first

level batch, by writing the @value at specified @addr and

triggering an

interrupt.

User can either poll for this user fence to signal or can also

wait on it

with i915_gem_wait_user_fence ioctl.

This is very much usefaul for long running contexts where

waiting on dma-fence

by user (like i915_gem_wait ioctl) is not supported.

*/

+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** +     * @addr: User/Memory fence qword aligned GPU virtual address. +     * +     * Address has to be a valid GPU virtual address at the time of +     * first level batch completion. +     */ +    __u64 addr;

+    /** +     * @value: User/Memory fence Value to be written to above address +     * after first level batch completes. +     */ +    __u64 value;

+    /** @rsvd: Reserved for future extensions, MBZ */ +    __u64 rsvd; +};

+/**

struct drm_i915_gem_create_ext_vm_private - Extension to make

the object

private to the specified VM.

See struct drm_i915_gem_create_ext.

*/

+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE        2 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** @vm_id: Id of the VM to which the object is private */ +    __u32 vm_id; +};

+/**

struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.

User/Memory fence can be woken up either by:

GPU context indicated by @ctx_id, or,

Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.

*    @ctx_id is ignored when this flag is set.

Wakeup condition is,

``((*addr & mask) op (value & mask))``

See :ref:`Documentation/driver-api/dma-buf.rst

<indefinite_dma_fences>`

*/

+struct drm_i915_gem_wait_user_fence { +    /** @extensions: Zero-terminated chain of extensions. */ +    __u64 extensions;

+    /** @addr: User/Memory fence address */ +    __u64 addr;

+    /** @ctx_id: Id of the Context which will signal the fence. */ +    __u32 ctx_id;

+    /** @op: Wakeup condition operator */ +    __u16 op; +#define I915_UFENCE_WAIT_EQ      0 +#define I915_UFENCE_WAIT_NEQ     1 +#define I915_UFENCE_WAIT_GT      2 +#define I915_UFENCE_WAIT_GTE     3 +#define I915_UFENCE_WAIT_LT      4 +#define I915_UFENCE_WAIT_LTE     5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER   7

+    /** +     * @flags: Supported flags are, +     * +     * I915_UFENCE_WAIT_SOFT: +     * +     * To be woken up by i915 driver async worker (not by GPU). +     * +     * I915_UFENCE_WAIT_ABSTIME: +     * +     * Wait timeout specified as absolute time. +     */ +    __u16 flags; +#define I915_UFENCE_WAIT_SOFT    0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2

+    /** @value: Wakeup value */ +    __u64 value;

+    /** @mask: Wakeup mask */ +    __u64 mask; +#define I915_UFENCE_WAIT_U8     0xffu +#define I915_UFENCE_WAIT_U16    0xffffu +#define I915_UFENCE_WAIT_U32    0xfffffffful +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull

+    /** +     * @timeout: Wait timeout in nanoseconds. +     * +     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the +     * absolute time in nsec. +     */ +    __s64 timeout; +};

Niranjana Vishwanathapura

9:32 p.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:

...

On 08/06/2022 08:17, Tvrtko Ursulin wrote:

...
On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:

...
On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:

...
On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:

...
VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references.     Also add new uapi and documentation as per review comments     from Daniel.

Signed-off-by: Niranjana Vishwanathapura

niranjana.vishwanathapura@intel.com

Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*

Copyright © 2022 Intel Corporation

*/

+/**

DOC: I915_PARAM_HAS_VM_BIND

VM_BIND feature availability.

See typedef drm_i915_getparam_t param.

*/

+#define I915_PARAM_HAS_VM_BIND        57

+/**

DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND

Flag to opt-in for VM_BIND mode of binding during VM creation.

See struct drm_i915_gem_vm_control flags.

A VM in VM_BIND mode will not support the older execbuff

mode of binding.

In VM_BIND mode, execbuff ioctl will not accept any

execlist (ie., the

&drm_i915_gem_execbuffer2.buffer_count must be 0).

Also, &drm_i915_gem_execbuffer2.batch_start_offset and

&drm_i915_gem_execbuffer2.batch_len must be 0.

DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must

be provided

to pass in the batch buffer addresses.

Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and

I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0

(not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag

must always be

set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

The buffers_ptr, buffer_count, batch_start_offset and

batch_len fields

of struct drm_i915_gem_execbuffer2 are also not used and must be 0.

*/

+#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)

+/**

DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING

Flag to declare context as long running.

See struct drm_i915_gem_context_create_ext flags.

Usage of dma-fence expects that they complete in

reasonable amount of time.

Compute on the other hand can be long running. Hence it is

not appropriate

for compute contexts to export request completion

dma-fence to user.

The dma-fence usage will be limited to in-kernel consumption only.

Compute contexts need to use user/memory fence.

So, long running contexts do not support output fences. Hence,

I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and

I915_EXEC_FENCE_SIGNAL (See

&drm_i915_gem_exec_fence.flags) are expected

to be not used.

DRM_I915_GEM_WAIT ioctl call is also not supported for

objects mapped

to long running contexts.

*/

+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)

+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND        0x3d +#define DRM_I915_GEM_VM_UNBIND        0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f

+#define DRM_IOCTL_I915_GEM_VM_BIND        DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)

+/**

struct drm_i915_gem_vm_bind - VA to object mapping to bind.

This structure is passed to VM_BIND ioctl and specifies

the mapping of GPU

virtual address (VA) range to the section of an object

that should be bound

in the device page table of the specified address space (VM).

The VA range specified must be unique (ie., not currently

bound) and can

be mapped to whole object or a section of the object

(partial binding).

Multiple VA mappings can be created to the same section of

the object

(aliasing).

*/

+struct drm_i915_gem_vm_bind { +    /** @vm_id: VM (address space) id to bind */ +    __u32 vm_id;

+    /** @handle: Object handle */ +    __u32 handle;

+    /** @start: Virtual Address start to bind */ +    __u64 start;

+    /** @offset: Offset in object to bind */ +    __u64 offset;

+    /** @length: Length of mapping to bind */ +    __u64 length;

Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the remainder of the space to a dummy object? In which case would there be any alignment/padding issues preventing the two bind to be placed next to each other?

I ask because someone from the compute side asked me about a problem with their strategy of dealing with overfetch and I suggested pad to size.

Thanks Tvrtko, I think we shouldn't be needing it. As with VM_BIND VA assignment is completely pushed to userspace, no padding should be necessary once the 'start' and 'size' alignment conditions are met.

I will add some documentation on alignment requirement here. Generally, 'start' and 'size' should be 4K aligned. But, I think when we have 64K lmem page sizes (dg2 and xehpsdv), they need to be 64K aligned.

Matt

Align to 64k is enough for all overfetch issues?

Apparently compute has a situation where a buffer is received by one component and another has to apply more alignment to it, to deal with overfetch. Since they cannot grow the actual BO if they wanted to VM_BIND a scratch area on top? Or perhaps none of this is a problem on discrete and original BO should be correctly allocated to start with.

Side question - what about the align to 2MiB mentioned in i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to discrete?

Not sure about the overfetch thing, but yeah dg2 & xehpsdv both require a minimum of 64K pages underneath for local memory, and the BO size will also be rounded up accordingly. And yeah the complication arises due to not being able to mix 4K + 64K GTT pages within the same page-table (existed since even gen8). Note that 4K here is what we typically get for system memory.

Originally we had a memory coloring scheme to track the "color" of each page-table, which basically ensures that userspace can't do something nasty like mixing page sizes. The advantage of that scheme is that we would only require 64K GTT alignment and no extra padding, but is perhaps a little complex.

The merged solution is just to align and pad (i.e vma->node.size and not vma->size) out of the vma to 2M, which is dead simple implementation wise, but does potentially waste some GTT space and some of the local memory used for the actual page-table. For the alignment the kernel just validates that the GTT address is aligned to 2M in vma_insert(), and then for the padding it just inflates it to 2M, if userspace hasn't already.

See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_cr...

Ok, those requirements (2M VA alignment) will apply to VM_BIND also. This is unfortunate, but it is not something new enforced by VM_BIND. Other option is to go with 64K alignment and in VM_BIND case, user must ensure there is no mix-matching of 64K (lmem) and 4k (smem) mappings in the same 2M range. But this is not VM_BIND specific (will apply to soft-pinning in execbuf2 also).

I don't think we need any VA padding here as with VM_BIND VA is managed fully by the user. If we enforce VA to be 2M aligned, it will leave holes (if BOs are smaller then 2M), but nobody is going to allocate anything form there.

Niranjana

...

...
Regards,

Tvrtko

...
Niranjana

...
Regards,

Tvrtko

...

+    /** +     * @flags: Supported flags are, +     * +     * I915_GEM_VM_BIND_READONLY: +     * Mapping is read-only. +     * +     * I915_GEM_VM_BIND_CAPTURE: +     * Capture this mapping in the dump upon GPU error. +     */ +    __u64 flags; +#define I915_GEM_VM_BIND_READONLY    (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)

+    /** @extensions: 0-terminated chain of extensions for this mapping. */ +    __u64 extensions; +};

+/**

struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.

This structure is passed to VM_UNBIND ioctl and specifies

the GPU virtual

address (VA) range that should be unbound from the device

page table of the

specified address space (VM). The specified VA range must

match one of the

mappings created with the VM_BIND ioctl. TLB is flushed upon unbind

completion.

*/

+struct drm_i915_gem_vm_unbind { +    /** @vm_id: VM (address space) id to bind */ +    __u32 vm_id;

+    /** @rsvd: Reserved for future use; must be zero. */ +    __u32 rsvd;

+    /** @start: Virtual Address start to unbind */ +    __u64 start;

+    /** @length: Length of mapping to unbind */ +    __u64 length;

+    /** @flags: reserved for future usage, currently MBZ */ +    __u64 flags;

+    /** @extensions: 0-terminated chain of extensions for this mapping. */ +    __u64 extensions; +};

+/**

struct drm_i915_vm_bind_fence - An input or output fence

for the vm_bind

or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for input

fence to signal

before starting the binding or unbinding.

The vm_bind or vm_unbind async worker will signal the

returned output fence

after the completion of binding or unbinding.

*/

+struct drm_i915_vm_bind_fence { +    /** @handle: User's handle for a drm_syncobj to wait on or signal. */ +    __u32 handle;

+    /** +     * @flags: Supported flags are, +     * +     * I915_VM_BIND_FENCE_WAIT: +     * Wait for the input fence before binding/unbinding +     * +     * I915_VM_BIND_FENCE_SIGNAL: +     * Return bind/unbind completion fence as output +     */ +    __u32 flags; +#define I915_VM_BIND_FENCE_WAIT            (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};

+/**

struct drm_i915_vm_bind_ext_timeline_fences - Timeline

fences for vm_bind

and vm_unbind.

This structure describes an array of timeline drm_syncobj

and associated

points for timeline variants of drm_syncobj. These

timeline 'drm_syncobj's

can be input or output fences (See struct drm_i915_vm_bind_fence).

*/

+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES    0 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** +     * @fence_count: Number of elements in the @handles_ptr & @value_ptr +     * arrays. +     */ +    __u64 fence_count;

+    /** +     * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence +     * of length @fence_count. +     */ +    __u64 handles_ptr;

+    /** +     * @values_ptr: Pointer to an array of u64 values of length +     * @fence_count. +     * Values must be 0 for a binary drm_syncobj. A Value of 0 for a +     * timeline drm_syncobj is invalid as it turns a drm_syncobj into a +     * binary one. +     */ +    __u64 values_ptr; +};

+/**

struct drm_i915_vm_bind_user_fence - An input or output

user fence for the

vm_bind or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for the

input fence (value at

@addr to become equal to @val) before starting the binding

or unbinding.

The vm_bind or vm_unbind async worker will signal the

output fence after

the completion of binding or unbinding by writing @val to

memory location at

@addr

*/

+struct drm_i915_vm_bind_user_fence { +    /** @addr: User/Memory fence qword aligned process virtual address */ +    __u64 addr;

+    /** @val: User/Memory fence value to be written after bind completion */ +    __u64 val;

+    /** +     * @flags: Supported flags are, +     * +     * I915_VM_BIND_USER_FENCE_WAIT: +     * Wait for the input fence before binding/unbinding +     * +     * I915_VM_BIND_USER_FENCE_SIGNAL: +     * Return bind/unbind completion fence as output +     */ +    __u32 flags; +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ +    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) +};

+/**

struct drm_i915_vm_bind_ext_user_fence - User/memory

fences for vm_bind

and vm_unbind.

These user fences can be input or output fences

(See struct drm_i915_vm_bind_user_fence).

*/

+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES    1 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** @fence_count: Number of elements in the @user_fence_ptr array. */ +    __u64 fence_count;

+    /** +     * @user_fence_ptr: Pointer to an array of +     * struct drm_i915_vm_bind_user_fence of length @fence_count. +     */ +    __u64 user_fence_ptr; +};

+/**

struct drm_i915_gem_execbuffer_ext_batch_addresses - Array

of batch buffer

gpu virtual addresses.

In the execbuff ioctl (See struct

drm_i915_gem_execbuffer2), this extension

must always be appended in the VM_BIND mode and it will be

an error to

append this extension in older non-VM_BIND mode.

*/

+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** @count: Number of addresses in the addr array. */ +    __u32 count;

+    /** @addr: An array of batch gpu virtual addresses. */ +    __u64 addr[0]; +};

+/**

struct drm_i915_gem_execbuffer_ext_user_fence - First

level batch completion

signaling extension.

This extension allows user to attach a user fence (@addr,

@value pair) to an

execbuf to be signaled by the command streamer after the

completion of first

level batch, by writing the @value at specified @addr and

triggering an

interrupt.

User can either poll for this user fence to signal or can

also wait on it

with i915_gem_wait_user_fence ioctl.

This is very much usefaul for long running contexts where

waiting on dma-fence

by user (like i915_gem_wait ioctl) is not supported.

*/

+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** +     * @addr: User/Memory fence qword aligned GPU virtual address. +     * +     * Address has to be a valid GPU virtual address at the time of +     * first level batch completion. +     */ +    __u64 addr;

+    /** +     * @value: User/Memory fence Value to be written to above address +     * after first level batch completes. +     */ +    __u64 value;

+    /** @rsvd: Reserved for future extensions, MBZ */ +    __u64 rsvd; +};

+/**

struct drm_i915_gem_create_ext_vm_private - Extension to

make the object

private to the specified VM.

See struct drm_i915_gem_create_ext.

*/

+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE        2 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** @vm_id: Id of the VM to which the object is private */ +    __u32 vm_id; +};

+/**

struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.

User/Memory fence can be woken up either by:

GPU context indicated by @ctx_id, or,

Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.

*    @ctx_id is ignored when this flag is set.

Wakeup condition is,

``((*addr & mask) op (value & mask))``

See :ref:`Documentation/driver-api/dma-buf.rst

<indefinite_dma_fences>`

*/

+struct drm_i915_gem_wait_user_fence { +    /** @extensions: Zero-terminated chain of extensions. */ +    __u64 extensions;

+    /** @addr: User/Memory fence address */ +    __u64 addr;

+    /** @ctx_id: Id of the Context which will signal the fence. */ +    __u32 ctx_id;

+    /** @op: Wakeup condition operator */ +    __u16 op; +#define I915_UFENCE_WAIT_EQ      0 +#define I915_UFENCE_WAIT_NEQ     1 +#define I915_UFENCE_WAIT_GT      2 +#define I915_UFENCE_WAIT_GTE     3 +#define I915_UFENCE_WAIT_LT      4 +#define I915_UFENCE_WAIT_LTE     5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER   7

+    /** +     * @flags: Supported flags are, +     * +     * I915_UFENCE_WAIT_SOFT: +     * +     * To be woken up by i915 driver async worker (not by GPU). +     * +     * I915_UFENCE_WAIT_ABSTIME: +     * +     * Wait timeout specified as absolute time. +     */ +    __u16 flags; +#define I915_UFENCE_WAIT_SOFT    0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2

+    /** @value: Wakeup value */ +    __u64 value;

+    /** @mask: Wakeup mask */ +    __u64 mask; +#define I915_UFENCE_WAIT_U8     0xffu +#define I915_UFENCE_WAIT_U16    0xffffu +#define I915_UFENCE_WAIT_U32    0xfffffffful +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull

+    /** +     * @timeout: Wait timeout in nanoseconds. +     * +     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the +     * absolute time in nsec. +     */ +    __s64 timeout; +};

Matthew Auld

9 Jun 9 Jun

8:36 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 08/06/2022 22:32, Niranjana Vishwanathapura wrote:

...

On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:

...
On 08/06/2022 08:17, Tvrtko Ursulin wrote:

...
On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:

...
On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:

...
On 17/05/2022 19:32, Niranjana Vishwanathapura wrote:

...
VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references.     Also add new uapi and documentation as per review comments     from Daniel.

Signed-off-by: Niranjana Vishwanathapura

niranjana.vishwanathapura@intel.com

Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*

Copyright © 2022 Intel Corporation

*/

+/**

DOC: I915_PARAM_HAS_VM_BIND

VM_BIND feature availability.

See typedef drm_i915_getparam_t param.

*/

+#define I915_PARAM_HAS_VM_BIND        57

+/**

DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND

Flag to opt-in for VM_BIND mode of binding during VM creation.

See struct drm_i915_gem_vm_control flags.

A VM in VM_BIND mode will not support the older execbuff mode

of binding.

In VM_BIND mode, execbuff ioctl will not accept any execlist

(ie., the

&drm_i915_gem_execbuffer2.buffer_count must be 0).

Also, &drm_i915_gem_execbuffer2.batch_start_offset and

&drm_i915_gem_execbuffer2.batch_len must be 0.

DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be

provided

to pass in the batch buffer addresses.

Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and

I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must

be 0

(not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must

always be

set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

The buffers_ptr, buffer_count, batch_start_offset and

batch_len fields

of struct drm_i915_gem_execbuffer2 are also not used and must

be 0.

*/

+#define I915_VM_CREATE_FLAGS_USE_VM_BIND    (1 << 0)

+/**

DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING

Flag to declare context as long running.

See struct drm_i915_gem_context_create_ext flags.

Usage of dma-fence expects that they complete in reasonable

amount of time.

Compute on the other hand can be long running. Hence it is not

appropriate

for compute contexts to export request completion dma-fence to

user.

The dma-fence usage will be limited to in-kernel consumption

only.

Compute contexts need to use user/memory fence.

So, long running contexts do not support output fences. Hence,

I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and

I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags)

are expected

to be not used.

DRM_I915_GEM_WAIT ioctl call is also not supported for objects

mapped

to long running contexts.

*/

+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING   (1u << 2)

+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND        0x3d +#define DRM_I915_GEM_VM_UNBIND        0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE    0x3f

+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)

+/**

struct drm_i915_gem_vm_bind - VA to object mapping to bind.

This structure is passed to VM_BIND ioctl and specifies the

mapping of GPU

virtual address (VA) range to the section of an object that

should be bound

in the device page table of the specified address space (VM).

The VA range specified must be unique (ie., not currently

bound) and can

be mapped to whole object or a section of the object (partial

binding).

Multiple VA mappings can be created to the same section of the

object

(aliasing).

*/

+struct drm_i915_gem_vm_bind { +    /** @vm_id: VM (address space) id to bind */ +    __u32 vm_id;

+    /** @handle: Object handle */ +    __u32 handle;

+    /** @start: Virtual Address start to bind */ +    __u64 start;

+    /** @offset: Offset in object to bind */ +    __u64 offset;

+    /** @length: Length of mapping to bind */ +    __u64 length;

Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the remainder of the space to a dummy object? In which case would there be any alignment/padding issues preventing the two bind to be placed next to each other?

I ask because someone from the compute side asked me about a problem with their strategy of dealing with overfetch and I suggested pad to size.

Thanks Tvrtko, I think we shouldn't be needing it. As with VM_BIND VA assignment is completely pushed to userspace, no padding should be necessary once the 'start' and 'size' alignment conditions are met.

I will add some documentation on alignment requirement here. Generally, 'start' and 'size' should be 4K aligned. But, I think when we have 64K lmem page sizes (dg2 and xehpsdv), they need to be 64K aligned.

Matt

Align to 64k is enough for all overfetch issues?

Apparently compute has a situation where a buffer is received by one component and another has to apply more alignment to it, to deal with overfetch. Since they cannot grow the actual BO if they wanted to VM_BIND a scratch area on top? Or perhaps none of this is a problem on discrete and original BO should be correctly allocated to start with.

Side question - what about the align to 2MiB mentioned in i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to discrete?

Not sure about the overfetch thing, but yeah dg2 & xehpsdv both require a minimum of 64K pages underneath for local memory, and the BO size will also be rounded up accordingly. And yeah the complication arises due to not being able to mix 4K + 64K GTT pages within the same page-table (existed since even gen8). Note that 4K here is what we typically get for system memory.

Originally we had a memory coloring scheme to track the "color" of each page-table, which basically ensures that userspace can't do something nasty like mixing page sizes. The advantage of that scheme is that we would only require 64K GTT alignment and no extra padding, but is perhaps a little complex.

The merged solution is just to align and pad (i.e vma->node.size and not vma->size) out of the vma to 2M, which is dead simple implementation wise, but does potentially waste some GTT space and some of the local memory used for the actual page-table. For the alignment the kernel just validates that the GTT address is aligned to 2M in vma_insert(), and then for the padding it just inflates it to 2M, if userspace hasn't already.

See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_cr...

Ok, those requirements (2M VA alignment) will apply to VM_BIND also. This is unfortunate, but it is not something new enforced by VM_BIND. Other option is to go with 64K alignment and in VM_BIND case, user must ensure there is no mix-matching of 64K (lmem) and 4k (smem) mappings in the same 2M range. But this is not VM_BIND specific (will apply to soft-pinning in execbuf2 also).

I don't think we need any VA padding here as with VM_BIND VA is managed fully by the user. If we enforce VA to be 2M aligned, it will leave holes (if BOs are smaller then 2M), but nobody is going to allocate anything form there.

Note that we only apply the 2M alignment + padding for local memory pages, for system memory we don't have/need such restrictions. The VA padding then importantly prevents userspace from incorrectly (or maliciously) inserting 4K system memory object in some page-table operating in 64K GTT mode.

...

Niranjana

...
...
Regards,

Tvrtko

...
Niranjana

...
Regards,

Tvrtko

...

+    /** +     * @flags: Supported flags are, +     * +     * I915_GEM_VM_BIND_READONLY: +     * Mapping is read-only. +     * +     * I915_GEM_VM_BIND_CAPTURE: +     * Capture this mapping in the dump upon GPU error. +     */ +    __u64 flags; +#define I915_GEM_VM_BIND_READONLY    (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE     (1 << 1)

+    /** @extensions: 0-terminated chain of extensions for this mapping. */ +    __u64 extensions; +};

+/**

struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.

This structure is passed to VM_UNBIND ioctl and specifies the

GPU virtual

address (VA) range that should be unbound from the device page

table of the

specified address space (VM). The specified VA range must

match one of the

mappings created with the VM_BIND ioctl. TLB is flushed upon

unbind

completion.

*/

+struct drm_i915_gem_vm_unbind { +    /** @vm_id: VM (address space) id to bind */ +    __u32 vm_id;

+    /** @rsvd: Reserved for future use; must be zero. */ +    __u32 rsvd;

+    /** @start: Virtual Address start to unbind */ +    __u64 start;

+    /** @length: Length of mapping to unbind */ +    __u64 length;

+    /** @flags: reserved for future usage, currently MBZ */ +    __u64 flags;

+    /** @extensions: 0-terminated chain of extensions for this mapping. */ +    __u64 extensions; +};

+/**

struct drm_i915_vm_bind_fence - An input or output fence for

the vm_bind

or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for input fence

to signal

before starting the binding or unbinding.

The vm_bind or vm_unbind async worker will signal the returned

output fence

after the completion of binding or unbinding.

*/

+struct drm_i915_vm_bind_fence { +    /** @handle: User's handle for a drm_syncobj to wait on or signal. */ +    __u32 handle;

+    /** +     * @flags: Supported flags are, +     * +     * I915_VM_BIND_FENCE_WAIT: +     * Wait for the input fence before binding/unbinding +     * +     * I915_VM_BIND_FENCE_SIGNAL: +     * Return bind/unbind completion fence as output +     */ +    __u32 flags; +#define I915_VM_BIND_FENCE_WAIT            (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL          (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};

+/**

struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences

for vm_bind

and vm_unbind.

This structure describes an array of timeline drm_syncobj and

associated

points for timeline variants of drm_syncobj. These timeline

'drm_syncobj's

can be input or output fences (See struct

drm_i915_vm_bind_fence).

*/

+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES    0 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** +     * @fence_count: Number of elements in the @handles_ptr & @value_ptr +     * arrays. +     */ +    __u64 fence_count;

+    /** +     * @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence +     * of length @fence_count. +     */ +    __u64 handles_ptr;

+    /** +     * @values_ptr: Pointer to an array of u64 values of length +     * @fence_count. +     * Values must be 0 for a binary drm_syncobj. A Value of 0 for a +     * timeline drm_syncobj is invalid as it turns a drm_syncobj into a +     * binary one. +     */ +    __u64 values_ptr; +};

+/**

struct drm_i915_vm_bind_user_fence - An input or output user

fence for the

vm_bind or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for the input

fence (value at

@addr to become equal to @val) before starting the binding or

unbinding.

The vm_bind or vm_unbind async worker will signal the output

fence after

the completion of binding or unbinding by writing @val to

memory location at

@addr

*/

+struct drm_i915_vm_bind_user_fence { +    /** @addr: User/Memory fence qword aligned process virtual address */ +    __u64 addr;

+    /** @val: User/Memory fence value to be written after bind completion */ +    __u64 val;

+    /** +     * @flags: Supported flags are, +     * +     * I915_VM_BIND_USER_FENCE_WAIT: +     * Wait for the input fence before binding/unbinding +     * +     * I915_VM_BIND_USER_FENCE_SIGNAL: +     * Return bind/unbind completion fence as output +     */ +    __u32 flags; +#define I915_VM_BIND_USER_FENCE_WAIT            (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL          (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ +    (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) +};

+/**

struct drm_i915_vm_bind_ext_user_fence - User/memory fences

for vm_bind

and vm_unbind.

These user fences can be input or output fences

(See struct drm_i915_vm_bind_user_fence).

*/

+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES    1 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** @fence_count: Number of elements in the @user_fence_ptr array. */ +    __u64 fence_count;

+    /** +     * @user_fence_ptr: Pointer to an array of +     * struct drm_i915_vm_bind_user_fence of length @fence_count. +     */ +    __u64 user_fence_ptr; +};

+/**

struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of

batch buffer

gpu virtual addresses.

In the execbuff ioctl (See struct drm_i915_gem_execbuffer2),

this extension

must always be appended in the VM_BIND mode and it will be an

error to

append this extension in older non-VM_BIND mode.

*/

+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES    1 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** @count: Number of addresses in the addr array. */ +    __u32 count;

+    /** @addr: An array of batch gpu virtual addresses. */ +    __u64 addr[0]; +};

+/**

struct drm_i915_gem_execbuffer_ext_user_fence - First level

batch completion

signaling extension.

This extension allows user to attach a user fence (@addr,

@value pair) to an

execbuf to be signaled by the command streamer after the

completion of first

level batch, by writing the @value at specified @addr and

triggering an

interrupt.

User can either poll for this user fence to signal or can also

wait on it

with i915_gem_wait_user_fence ioctl.

This is very much usefaul for long running contexts where

waiting on dma-fence

by user (like i915_gem_wait ioctl) is not supported.

*/

+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE        2 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** +     * @addr: User/Memory fence qword aligned GPU virtual address. +     * +     * Address has to be a valid GPU virtual address at the time of +     * first level batch completion. +     */ +    __u64 addr;

+    /** +     * @value: User/Memory fence Value to be written to above address +     * after first level batch completes. +     */ +    __u64 value;

+    /** @rsvd: Reserved for future extensions, MBZ */ +    __u64 rsvd; +};

+/**

struct drm_i915_gem_create_ext_vm_private - Extension to make

the object

private to the specified VM.

See struct drm_i915_gem_create_ext.

*/

+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE        2 +    /** @base: Extension link. See struct i915_user_extension. */ +    struct i915_user_extension base;

+    /** @vm_id: Id of the VM to which the object is private */ +    __u32 vm_id; +};

+/**

struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.

User/Memory fence can be woken up either by:

GPU context indicated by @ctx_id, or,

Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.

*    @ctx_id is ignored when this flag is set.

Wakeup condition is,

``((*addr & mask) op (value & mask))``

See :ref:`Documentation/driver-api/dma-buf.rst

<indefinite_dma_fences>`

*/

+struct drm_i915_gem_wait_user_fence { +    /** @extensions: Zero-terminated chain of extensions. */ +    __u64 extensions;

+    /** @addr: User/Memory fence address */ +    __u64 addr;

+    /** @ctx_id: Id of the Context which will signal the fence. */ +    __u32 ctx_id;

+    /** @op: Wakeup condition operator */ +    __u16 op; +#define I915_UFENCE_WAIT_EQ      0 +#define I915_UFENCE_WAIT_NEQ     1 +#define I915_UFENCE_WAIT_GT      2 +#define I915_UFENCE_WAIT_GTE     3 +#define I915_UFENCE_WAIT_LT      4 +#define I915_UFENCE_WAIT_LTE     5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER   7

+    /** +     * @flags: Supported flags are, +     * +     * I915_UFENCE_WAIT_SOFT: +     * +     * To be woken up by i915 driver async worker (not by GPU). +     * +     * I915_UFENCE_WAIT_ABSTIME: +     * +     * Wait timeout specified as absolute time. +     */ +    __u16 flags; +#define I915_UFENCE_WAIT_SOFT    0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2

+    /** @value: Wakeup value */ +    __u64 value;

+    /** @mask: Wakeup mask */ +    __u64 mask; +#define I915_UFENCE_WAIT_U8     0xffu +#define I915_UFENCE_WAIT_U16    0xffffu +#define I915_UFENCE_WAIT_U32    0xfffffffful +#define I915_UFENCE_WAIT_U64    0xffffffffffffffffull

+    /** +     * @timeout: Wait timeout in nanoseconds. +     * +     * If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the +     * absolute time in nsec. +     */ +    __s64 timeout; +};

Niranjana Vishwanathapura

6:53 p.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On Thu, Jun 09, 2022 at 09:36:48AM +0100, Matthew Auld wrote:

...

On 08/06/2022 22:32, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:

...
On 08/06/2022 08:17, Tvrtko Ursulin wrote:

...
On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:

...
On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote:

...
On 17/05/2022 19:32, Niranjana Vishwanathapura wrote: >VM_BIND and related uapi definitions > >v2: Ensure proper kernel-doc formatting with cross references. > Also add new uapi and documentation as per review comments > from Daniel. > >Signed-off-by: Niranjana Vishwanathapura >niranjana.vishwanathapura@intel.com >--- > Documentation/gpu/rfc/i915_vm_bind.h | 399 >+++++++++++++++++++++++++++ > 1 file changed, 399 insertions(+) > create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h > >diff --git a/Documentation/gpu/rfc/i915_vm_bind.h >b/Documentation/gpu/rfc/i915_vm_bind.h >new file mode 100644 >index 000000000000..589c0a009107 >--- /dev/null >+++ b/Documentation/gpu/rfc/i915_vm_bind.h >@@ -0,0 +1,399 @@ >+/* SPDX-License-Identifier: MIT */ >+/* >+ * Copyright © 2022 Intel Corporation >+ */ >+ >+/** >+ * DOC: I915_PARAM_HAS_VM_BIND >+ * >+ * VM_BIND feature availability. >+ * See typedef drm_i915_getparam_t param. >+ */ >+#define I915_PARAM_HAS_VM_BIND 57 >+ >+/** >+ * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >+ * >+ * Flag to opt-in for VM_BIND mode of binding during VM creation. >+ * See struct drm_i915_gem_vm_control flags. >+ * >+ * A VM in VM_BIND mode will not support the older >execbuff mode of binding. >+ * In VM_BIND mode, execbuff ioctl will not accept any >execlist (ie., the >+ * &drm_i915_gem_execbuffer2.buffer_count must be 0). >+ * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >+ * &drm_i915_gem_execbuffer2.batch_len must be 0. >+ * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension >must be provided >+ * to pass in the batch buffer addresses. >+ * >+ * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >+ * I915_EXEC_BATCH_FIRST of >&drm_i915_gem_execbuffer2.flags must be 0 >+ * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS >flag must always be >+ * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >+ * The buffers_ptr, buffer_count, batch_start_offset and >batch_len fields >+ * of struct drm_i915_gem_execbuffer2 are also not used >and must be 0. >+ */ >+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0) >+ >+/** >+ * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING >+ * >+ * Flag to declare context as long running. >+ * See struct drm_i915_gem_context_create_ext flags. >+ * >+ * Usage of dma-fence expects that they complete in >reasonable amount of time. >+ * Compute on the other hand can be long running. Hence >it is not appropriate >+ * for compute contexts to export request completion >dma-fence to user. >+ * The dma-fence usage will be limited to in-kernel >consumption only. >+ * Compute contexts need to use user/memory fence. >+ * >+ * So, long running contexts do not support output fences. Hence, >+ * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and >+ * I915_EXEC_FENCE_SIGNAL (See >&drm_i915_gem_exec_fence.flags) are expected >+ * to be not used. >+ * >+ * DRM_I915_GEM_WAIT ioctl call is also not supported for >objects mapped >+ * to long running contexts. >+ */ >+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2) >+ >+/* VM_BIND related ioctls */ >+#define DRM_I915_GEM_VM_BIND 0x3d >+#define DRM_I915_GEM_VM_UNBIND 0x3e >+#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f >+ >+#define DRM_IOCTL_I915_GEM_VM_BIND >DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct >drm_i915_gem_vm_bind) >+#define DRM_IOCTL_I915_GEM_VM_UNBIND >DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct >drm_i915_gem_vm_bind) >+#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE >DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, >struct drm_i915_gem_wait_user_fence) >+ >+/** >+ * struct drm_i915_gem_vm_bind - VA to object mapping to bind. >+ * >+ * This structure is passed to VM_BIND ioctl and >specifies the mapping of GPU >+ * virtual address (VA) range to the section of an object >that should be bound >+ * in the device page table of the specified address space (VM). >+ * The VA range specified must be unique (ie., not >currently bound) and can >+ * be mapped to whole object or a section of the object >(partial binding). >+ * Multiple VA mappings can be created to the same >section of the object >+ * (aliasing). >+ */ >+struct drm_i915_gem_vm_bind { >+ /** @vm_id: VM (address space) id to bind */ >+ __u32 vm_id; >+ >+ /** @handle: Object handle */ >+ __u32 handle; >+ >+ /** @start: Virtual Address start to bind */ >+ __u64 start; >+ >+ /** @offset: Offset in object to bind */ >+ __u64 offset; >+ >+ /** @length: Length of mapping to bind */ >+ __u64 length;

Does it support, or should it, equivalent of EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map the remainder of the space to a dummy object? In which case would there be any alignment/padding issues preventing the two bind to be placed next to each other?

I ask because someone from the compute side asked me about a problem with their strategy of dealing with overfetch and I suggested pad to size.

Thanks Tvrtko, I think we shouldn't be needing it. As with VM_BIND VA assignment is completely pushed to userspace, no padding should be necessary once the 'start' and 'size' alignment conditions are met.

I will add some documentation on alignment requirement here. Generally, 'start' and 'size' should be 4K aligned. But, I think when we have 64K lmem page sizes (dg2 and xehpsdv), they need to be 64K aligned.

Matt

Align to 64k is enough for all overfetch issues?

Apparently compute has a situation where a buffer is received by one component and another has to apply more alignment to it, to deal with overfetch. Since they cannot grow the actual BO if they wanted to VM_BIND a scratch area on top? Or perhaps none of this is a problem on discrete and original BO should be correctly allocated to start with.

Side question - what about the align to 2MiB mentioned in i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to discrete?

Not sure about the overfetch thing, but yeah dg2 & xehpsdv both require a minimum of 64K pages underneath for local memory, and the BO size will also be rounded up accordingly. And yeah the complication arises due to not being able to mix 4K + 64K GTT pages within the same page-table (existed since even gen8). Note that 4K here is what we typically get for system memory.

Originally we had a memory coloring scheme to track the "color" of each page-table, which basically ensures that userspace can't do something nasty like mixing page sizes. The advantage of that scheme is that we would only require 64K GTT alignment and no extra padding, but is perhaps a little complex.

The merged solution is just to align and pad (i.e vma->node.size and not vma->size) out of the vma to 2M, which is dead simple implementation wise, but does potentially waste some GTT space and some of the local memory used for the actual page-table. For the alignment the kernel just validates that the GTT address is aligned to 2M in vma_insert(), and then for the padding it just inflates it to 2M, if userspace hasn't already.

See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_cr...

Ok, those requirements (2M VA alignment) will apply to VM_BIND also. This is unfortunate, but it is not something new enforced by VM_BIND. Other option is to go with 64K alignment and in VM_BIND case, user must ensure there is no mix-matching of 64K (lmem) and 4k (smem) mappings in the same 2M range. But this is not VM_BIND specific (will apply to soft-pinning in execbuf2 also).

I don't think we need any VA padding here as with VM_BIND VA is managed fully by the user. If we enforce VA to be 2M aligned, it will leave holes (if BOs are smaller then 2M), but nobody is going to allocate anything form there.

Note that we only apply the 2M alignment + padding for local memory pages, for system memory we don't have/need such restrictions. The VA padding then importantly prevents userspace from incorrectly (or maliciously) inserting 4K system memory object in some page-table operating in 64K GTT mode.

Thanks Matt. I also, syned offline with Matt a bit on this. We don't need explicit 'pad_to_size' size. i915 driver is implicitly padding the size to 2M boundary for LMEM BOs which will apply for VM_BIND also. The remaining question is whether we enforce 2M VA alignment for lmem BOs (just like legacy execbuff path) on dg2 & xehpsdv, or go with just 64K alignment but ensure there is no mixing of 4K and 64K mappings in same 2M range. I think we can go with 2M alignment requirement for VM_BIND also. So, no new requirements here for VM_BIND.

I will update the documentation.

Niranjana

...

...
Niranjana

...
...
Regards,

Tvrtko

...
Niranjana

...
Regards,

Tvrtko

>+ >+ /** >+ * @flags: Supported flags are, >+ * >+ * I915_GEM_VM_BIND_READONLY: >+ * Mapping is read-only. >+ * >+ * I915_GEM_VM_BIND_CAPTURE: >+ * Capture this mapping in the dump upon GPU error. >+ */ >+ __u64 flags; >+#define I915_GEM_VM_BIND_READONLY (1 << 0) >+#define I915_GEM_VM_BIND_CAPTURE (1 << 1) >+ >+ /** @extensions: 0-terminated chain of extensions for >this mapping. */ >+ __u64 extensions; >+}; >+ >+/** >+ * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind. >+ * >+ * This structure is passed to VM_UNBIND ioctl and >specifies the GPU virtual >+ * address (VA) range that should be unbound from the >device page table of the >+ * specified address space (VM). The specified VA range >must match one of the >+ * mappings created with the VM_BIND ioctl. TLB is >flushed upon unbind >+ * completion. >+ */ >+struct drm_i915_gem_vm_unbind { >+ /** @vm_id: VM (address space) id to bind */ >+ __u32 vm_id; >+ >+ /** @rsvd: Reserved for future use; must be zero. */ >+ __u32 rsvd; >+ >+ /** @start: Virtual Address start to unbind */ >+ __u64 start; >+ >+ /** @length: Length of mapping to unbind */ >+ __u64 length; >+ >+ /** @flags: reserved for future usage, currently MBZ */ >+ __u64 flags; >+ >+ /** @extensions: 0-terminated chain of extensions for >this mapping. */ >+ __u64 extensions; >+}; >+ >+/** >+ * struct drm_i915_vm_bind_fence - An input or output >fence for the vm_bind >+ * or the vm_unbind work. >+ * >+ * The vm_bind or vm_unbind aync worker will wait for >input fence to signal >+ * before starting the binding or unbinding. >+ * >+ * The vm_bind or vm_unbind async worker will signal the >returned output fence >+ * after the completion of binding or unbinding. >+ */ >+struct drm_i915_vm_bind_fence { >+ /** @handle: User's handle for a drm_syncobj to wait >on or signal. */ >+ __u32 handle; >+ >+ /** >+ * @flags: Supported flags are, >+ * >+ * I915_VM_BIND_FENCE_WAIT: >+ * Wait for the input fence before binding/unbinding >+ * >+ * I915_VM_BIND_FENCE_SIGNAL: >+ * Return bind/unbind completion fence as output >+ */ >+ __u32 flags; >+#define I915_VM_BIND_FENCE_WAIT (1<<0) >+#define I915_VM_BIND_FENCE_SIGNAL (1<<1) >+#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS >(-(I915_VM_BIND_FENCE_SIGNAL << 1)) >+}; >+ >+/** >+ * struct drm_i915_vm_bind_ext_timeline_fences - Timeline >fences for vm_bind >+ * and vm_unbind. >+ * >+ * This structure describes an array of timeline >drm_syncobj and associated >+ * points for timeline variants of drm_syncobj. These >timeline 'drm_syncobj's >+ * can be input or output fences (See struct >drm_i915_vm_bind_fence). >+ */ >+struct drm_i915_vm_bind_ext_timeline_fences { >+#define I915_VM_BIND_EXT_timeline_FENCES 0 >+ /** @base: Extension link. See struct i915_user_extension. */ >+ struct i915_user_extension base; >+ >+ /** >+ * @fence_count: Number of elements in the >@handles_ptr & @value_ptr >+ * arrays. >+ */ >+ __u64 fence_count; >+ >+ /** >+ * @handles_ptr: Pointer to an array of struct >drm_i915_vm_bind_fence >+ * of length @fence_count. >+ */ >+ __u64 handles_ptr; >+ >+ /** >+ * @values_ptr: Pointer to an array of u64 values of length >+ * @fence_count. >+ * Values must be 0 for a binary drm_syncobj. A Value of 0 for a >+ * timeline drm_syncobj is invalid as it turns a >drm_syncobj into a >+ * binary one. >+ */ >+ __u64 values_ptr; >+}; >+ >+/** >+ * struct drm_i915_vm_bind_user_fence - An input or >output user fence for the >+ * vm_bind or the vm_unbind work. >+ * >+ * The vm_bind or vm_unbind aync worker will wait for the >input fence (value at >+ * @addr to become equal to @val) before starting the >binding or unbinding. >+ * >+ * The vm_bind or vm_unbind async worker will signal the >output fence after >+ * the completion of binding or unbinding by writing @val >to memory location at >+ * @addr >+ */ >+struct drm_i915_vm_bind_user_fence { >+ /** @addr: User/Memory fence qword aligned process >virtual address */ >+ __u64 addr; >+ >+ /** @val: User/Memory fence value to be written after >bind completion */ >+ __u64 val; >+ >+ /** >+ * @flags: Supported flags are, >+ * >+ * I915_VM_BIND_USER_FENCE_WAIT: >+ * Wait for the input fence before binding/unbinding >+ * >+ * I915_VM_BIND_USER_FENCE_SIGNAL: >+ * Return bind/unbind completion fence as output >+ */ >+ __u32 flags; >+#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) >+#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) >+#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ >+ (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) >+}; >+ >+/** >+ * struct drm_i915_vm_bind_ext_user_fence - User/memory >fences for vm_bind >+ * and vm_unbind. >+ * >+ * These user fences can be input or output fences >+ * (See struct drm_i915_vm_bind_user_fence). >+ */ >+struct drm_i915_vm_bind_ext_user_fence { >+#define I915_VM_BIND_EXT_USER_FENCES 1 >+ /** @base: Extension link. See struct i915_user_extension. */ >+ struct i915_user_extension base; >+ >+ /** @fence_count: Number of elements in the >@user_fence_ptr array. */ >+ __u64 fence_count; >+ >+ /** >+ * @user_fence_ptr: Pointer to an array of >+ * struct drm_i915_vm_bind_user_fence of length @fence_count. >+ */ >+ __u64 user_fence_ptr; >+}; >+ >+/** >+ * struct drm_i915_gem_execbuffer_ext_batch_addresses - >Array of batch buffer >+ * gpu virtual addresses. >+ * >+ * In the execbuff ioctl (See struct >drm_i915_gem_execbuffer2), this extension >+ * must always be appended in the VM_BIND mode and it >will be an error to >+ * append this extension in older non-VM_BIND mode. >+ */ >+struct drm_i915_gem_execbuffer_ext_batch_addresses { >+#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1 >+ /** @base: Extension link. See struct i915_user_extension. */ >+ struct i915_user_extension base; >+ >+ /** @count: Number of addresses in the addr array. */ >+ __u32 count; >+ >+ /** @addr: An array of batch gpu virtual addresses. */ >+ __u64 addr[0]; >+}; >+ >+/** >+ * struct drm_i915_gem_execbuffer_ext_user_fence - First >level batch completion >+ * signaling extension. >+ * >+ * This extension allows user to attach a user fence >(@addr, @value pair) to an >+ * execbuf to be signaled by the command streamer after >the completion of first >+ * level batch, by writing the @value at specified @addr >and triggering an >+ * interrupt. >+ * User can either poll for this user fence to signal or >can also wait on it >+ * with i915_gem_wait_user_fence ioctl. >+ * This is very much usefaul for long running contexts >where waiting on dma-fence >+ * by user (like i915_gem_wait ioctl) is not supported. >+ */ >+struct drm_i915_gem_execbuffer_ext_user_fence { >+#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2 >+ /** @base: Extension link. See struct i915_user_extension. */ >+ struct i915_user_extension base; >+ >+ /** >+ * @addr: User/Memory fence qword aligned GPU virtual address. >+ * >+ * Address has to be a valid GPU virtual address at the time of >+ * first level batch completion. >+ */ >+ __u64 addr; >+ >+ /** >+ * @value: User/Memory fence Value to be written to >above address >+ * after first level batch completes. >+ */ >+ __u64 value; >+ >+ /** @rsvd: Reserved for future extensions, MBZ */ >+ __u64 rsvd; >+}; >+ >+/** >+ * struct drm_i915_gem_create_ext_vm_private - Extension >to make the object >+ * private to the specified VM. >+ * >+ * See struct drm_i915_gem_create_ext. >+ */ >+struct drm_i915_gem_create_ext_vm_private { >+#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 >+ /** @base: Extension link. See struct i915_user_extension. */ >+ struct i915_user_extension base; >+ >+ /** @vm_id: Id of the VM to which the object is private */ >+ __u32 vm_id; >+}; >+ >+/** >+ * struct drm_i915_gem_wait_user_fence - Wait on user/memory fence. >+ * >+ * User/Memory fence can be woken up either by: >+ * >+ * 1. GPU context indicated by @ctx_id, or, >+ * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT. >+ * @ctx_id is ignored when this flag is set. >+ * >+ * Wakeup condition is, >+ * ``((*addr & mask) op (value & mask))`` >+ * >+ * See :ref:`Documentation/driver-api/dma-buf.rst ><indefinite_dma_fences>` >+ */ >+struct drm_i915_gem_wait_user_fence { >+ /** @extensions: Zero-terminated chain of extensions. */ >+ __u64 extensions; >+ >+ /** @addr: User/Memory fence address */ >+ __u64 addr; >+ >+ /** @ctx_id: Id of the Context which will signal the fence. */ >+ __u32 ctx_id; >+ >+ /** @op: Wakeup condition operator */ >+ __u16 op; >+#define I915_UFENCE_WAIT_EQ 0 >+#define I915_UFENCE_WAIT_NEQ 1 >+#define I915_UFENCE_WAIT_GT 2 >+#define I915_UFENCE_WAIT_GTE 3 >+#define I915_UFENCE_WAIT_LT 4 >+#define I915_UFENCE_WAIT_LTE 5 >+#define I915_UFENCE_WAIT_BEFORE 6 >+#define I915_UFENCE_WAIT_AFTER 7 >+ >+ /** >+ * @flags: Supported flags are, >+ * >+ * I915_UFENCE_WAIT_SOFT: >+ * >+ * To be woken up by i915 driver async worker (not by GPU). >+ * >+ * I915_UFENCE_WAIT_ABSTIME: >+ * >+ * Wait timeout specified as absolute time. >+ */ >+ __u16 flags; >+#define I915_UFENCE_WAIT_SOFT 0x1 >+#define I915_UFENCE_WAIT_ABSTIME 0x2 >+ >+ /** @value: Wakeup value */ >+ __u64 value; >+ >+ /** @mask: Wakeup mask */ >+ __u64 mask; >+#define I915_UFENCE_WAIT_U8 0xffu >+#define I915_UFENCE_WAIT_U16 0xffffu >+#define I915_UFENCE_WAIT_U32 0xfffffffful >+#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull >+ >+ /** >+ * @timeout: Wait timeout in nanoseconds. >+ * >+ * If I915_UFENCE_WAIT_ABSTIME flag is set, then time >timeout is the >+ * absolute time in nsec. >+ */ >+ __s64 timeout; >+};

Tvrtko Ursulin

10 Jun 10 Jun

10:16 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 09/06/2022 19:53, Niranjana Vishwanathapura wrote:

...

On Thu, Jun 09, 2022 at 09:36:48AM +0100, Matthew Auld wrote:

...
On 08/06/2022 22:32, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:

...
On 08/06/2022 08:17, Tvrtko Ursulin wrote:

...
On 07/06/2022 20:37, Niranjana Vishwanathapura wrote:

...
On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote: > > On 17/05/2022 19:32, Niranjana Vishwanathapura wrote: >> VM_BIND and related uapi definitions >> >> v2: Ensure proper kernel-doc formatting with cross references. >> Also add new uapi and documentation as per review comments >> from Daniel. >> >> Signed-off-by: Niranjana Vishwanathapura >> niranjana.vishwanathapura@intel.com >> --- >> Documentation/gpu/rfc/i915_vm_bind.h | 399 >> +++++++++++++++++++++++++++ >> 1 file changed, 399 insertions(+) >> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >> >> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h >> b/Documentation/gpu/rfc/i915_vm_bind.h >> new file mode 100644 >> index 000000000000..589c0a009107 >> --- /dev/null >> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >> @@ -0,0 +1,399 @@ >> +/* SPDX-License-Identifier: MIT */ >> +/* >> + * Copyright © 2022 Intel Corporation >> + */ >> + >> +/** >> + * DOC: I915_PARAM_HAS_VM_BIND >> + * >> + * VM_BIND feature availability. >> + * See typedef drm_i915_getparam_t param. >> + */ >> +#define I915_PARAM_HAS_VM_BIND 57 >> + >> +/** >> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >> + * >> + * Flag to opt-in for VM_BIND mode of binding during VM creation. >> + * See struct drm_i915_gem_vm_control flags. >> + * >> + * A VM in VM_BIND mode will not support the older execbuff >> mode of binding. >> + * In VM_BIND mode, execbuff ioctl will not accept any execlist >> (ie., the >> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must >> be provided >> + * to pass in the batch buffer addresses. >> + * >> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags >> must be 0 >> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag >> must always be >> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >> + * The buffers_ptr, buffer_count, batch_start_offset and >> batch_len fields >> + * of struct drm_i915_gem_execbuffer2 are also not used and >> must be 0. >> + */ >> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0) >> + >> +/** >> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING >> + * >> + * Flag to declare context as long running. >> + * See struct drm_i915_gem_context_create_ext flags. >> + * >> + * Usage of dma-fence expects that they complete in reasonable >> amount of time. >> + * Compute on the other hand can be long running. Hence it is >> not appropriate >> + * for compute contexts to export request completion dma-fence >> to user. >> + * The dma-fence usage will be limited to in-kernel consumption >> only. >> + * Compute contexts need to use user/memory fence. >> + * >> + * So, long running contexts do not support output fences. Hence, >> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and >> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) >> are expected >> + * to be not used. >> + * >> + * DRM_I915_GEM_WAIT ioctl call is also not supported for >> objects mapped >> + * to long running contexts. >> + */ >> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2) >> + >> +/* VM_BIND related ioctls */ >> +#define DRM_I915_GEM_VM_BIND 0x3d >> +#define DRM_I915_GEM_VM_UNBIND 0x3e >> +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f >> + >> +#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + >> DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) >> +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE >> + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) >> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE >> DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct >> drm_i915_gem_wait_user_fence) >> + >> +/** >> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind. >> + * >> + * This structure is passed to VM_BIND ioctl and specifies the >> mapping of GPU >> + * virtual address (VA) range to the section of an object that >> should be bound >> + * in the device page table of the specified address space (VM). >> + * The VA range specified must be unique (ie., not currently >> bound) and can >> + * be mapped to whole object or a section of the object >> (partial binding). >> + * Multiple VA mappings can be created to the same section of >> the object >> + * (aliasing). >> + */ >> +struct drm_i915_gem_vm_bind { >> + /** @vm_id: VM (address space) id to bind */ >> + __u32 vm_id; >> + >> + /** @handle: Object handle */ >> + __u32 handle; >> + >> + /** @start: Virtual Address start to bind */ >> + __u64 start; >> + >> + /** @offset: Offset in object to bind */ >> + __u64 offset; >> + >> + /** @length: Length of mapping to bind */ >> + __u64 length; > > Does it support, or should it, equivalent of > EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map > the remainder of the space to a dummy object? In which case would > there be any alignment/padding issues preventing the two bind to > be placed next to each other? > > I ask because someone from the compute side asked me about a > problem with their strategy of dealing with overfetch and I > suggested pad to size. >

Thanks Tvrtko, I think we shouldn't be needing it. As with VM_BIND VA assignment is completely pushed to userspace, no padding should be necessary once the 'start' and 'size' alignment conditions are met.

I will add some documentation on alignment requirement here. Generally, 'start' and 'size' should be 4K aligned. But, I think when we have 64K lmem page sizes (dg2 and xehpsdv), they need to be 64K aligned.

Matt

Align to 64k is enough for all overfetch issues?

Apparently compute has a situation where a buffer is received by one component and another has to apply more alignment to it, to deal with overfetch. Since they cannot grow the actual BO if they wanted to VM_BIND a scratch area on top? Or perhaps none of this is a problem on discrete and original BO should be correctly allocated to start with.

Side question - what about the align to 2MiB mentioned in i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to discrete?

Not sure about the overfetch thing, but yeah dg2 & xehpsdv both require a minimum of 64K pages underneath for local memory, and the BO size will also be rounded up accordingly. And yeah the complication arises due to not being able to mix 4K + 64K GTT pages within the same page-table (existed since even gen8). Note that 4K here is what we typically get for system memory.

Originally we had a memory coloring scheme to track the "color" of each page-table, which basically ensures that userspace can't do something nasty like mixing page sizes. The advantage of that scheme is that we would only require 64K GTT alignment and no extra padding, but is perhaps a little complex.

The merged solution is just to align and pad (i.e vma->node.size and not vma->size) out of the vma to 2M, which is dead simple implementation wise, but does potentially waste some GTT space and some of the local memory used for the actual page-table. For the alignment the kernel just validates that the GTT address is aligned to 2M in vma_insert(), and then for the padding it just inflates it to 2M, if userspace hasn't already.

See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_cr...

Ok, those requirements (2M VA alignment) will apply to VM_BIND also. This is unfortunate, but it is not something new enforced by VM_BIND. Other option is to go with 64K alignment and in VM_BIND case, user must ensure there is no mix-matching of 64K (lmem) and 4k (smem) mappings in the same 2M range. But this is not VM_BIND specific (will apply to soft-pinning in execbuf2 also).

I don't think we need any VA padding here as with VM_BIND VA is managed fully by the user. If we enforce VA to be 2M aligned, it will leave holes (if BOs are smaller then 2M), but nobody is going to allocate anything form there.

Note that we only apply the 2M alignment + padding for local memory pages, for system memory we don't have/need such restrictions. The VA padding then importantly prevents userspace from incorrectly (or maliciously) inserting 4K system memory object in some page-table operating in 64K GTT mode.

Thanks Matt. I also, syned offline with Matt a bit on this. We don't need explicit 'pad_to_size' size. i915 driver is implicitly padding the size to 2M boundary for LMEM BOs which will apply for VM_BIND also. The remaining question is whether we enforce 2M VA alignment for lmem BOs (just like legacy execbuff path) on dg2 & xehpsdv, or go with just 64K alignment but ensure there is no mixing of 4K and 64K

"Driver is implicitly padding the size to 2MB boundary" - this is the backing store?

...

mappings in same 2M range. I think we can go with 2M alignment requirement for VM_BIND also. So, no new requirements here for VM_BIND.

Are there any considerations here of letting the userspace know? Presumably userspace allocator has to know or it would try to ask for impossible addresses.

Regards,

Tvrtko

...

I will update the documentation.

Niranjana

...
...
Niranjana

...
...
Regards,

Tvrtko

...
Niranjana

> Regards, > > Tvrtko > >> + >> + /** >> + * @flags: Supported flags are, >> + * >> + * I915_GEM_VM_BIND_READONLY: >> + * Mapping is read-only. >> + * >> + * I915_GEM_VM_BIND_CAPTURE: >> + * Capture this mapping in the dump upon GPU error. >> + */ >> + __u64 flags; >> +#define I915_GEM_VM_BIND_READONLY (1 << 0) >> +#define I915_GEM_VM_BIND_CAPTURE (1 << 1) >> + >> + /** @extensions: 0-terminated chain of extensions for this >> mapping. */ >> + __u64 extensions; >> +}; >> + >> +/** >> + * struct drm_i915_gem_vm_unbind - VA to object mapping to unbind. >> + * >> + * This structure is passed to VM_UNBIND ioctl and specifies >> the GPU virtual >> + * address (VA) range that should be unbound from the device >> page table of the >> + * specified address space (VM). The specified VA range must >> match one of the >> + * mappings created with the VM_BIND ioctl. TLB is flushed upon >> unbind >> + * completion. >> + */ >> +struct drm_i915_gem_vm_unbind { >> + /** @vm_id: VM (address space) id to bind */ >> + __u32 vm_id; >> + >> + /** @rsvd: Reserved for future use; must be zero. */ >> + __u32 rsvd; >> + >> + /** @start: Virtual Address start to unbind */ >> + __u64 start; >> + >> + /** @length: Length of mapping to unbind */ >> + __u64 length; >> + >> + /** @flags: reserved for future usage, currently MBZ */ >> + __u64 flags; >> + >> + /** @extensions: 0-terminated chain of extensions for this >> mapping. */ >> + __u64 extensions; >> +}; >> + >> +/** >> + * struct drm_i915_vm_bind_fence - An input or output fence for >> the vm_bind >> + * or the vm_unbind work. >> + * >> + * The vm_bind or vm_unbind aync worker will wait for input >> fence to signal >> + * before starting the binding or unbinding. >> + * >> + * The vm_bind or vm_unbind async worker will signal the >> returned output fence >> + * after the completion of binding or unbinding. >> + */ >> +struct drm_i915_vm_bind_fence { >> + /** @handle: User's handle for a drm_syncobj to wait on or >> signal. */ >> + __u32 handle; >> + >> + /** >> + * @flags: Supported flags are, >> + * >> + * I915_VM_BIND_FENCE_WAIT: >> + * Wait for the input fence before binding/unbinding >> + * >> + * I915_VM_BIND_FENCE_SIGNAL: >> + * Return bind/unbind completion fence as output >> + */ >> + __u32 flags; >> +#define I915_VM_BIND_FENCE_WAIT (1<<0) >> +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) >> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS >> (-(I915_VM_BIND_FENCE_SIGNAL << 1)) >> +}; >> + >> +/** >> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline >> fences for vm_bind >> + * and vm_unbind. >> + * >> + * This structure describes an array of timeline drm_syncobj >> and associated >> + * points for timeline variants of drm_syncobj. These timeline >> 'drm_syncobj's >> + * can be input or output fences (See struct >> drm_i915_vm_bind_fence). >> + */ >> +struct drm_i915_vm_bind_ext_timeline_fences { >> +#define I915_VM_BIND_EXT_timeline_FENCES 0 >> + /** @base: Extension link. See struct i915_user_extension. */ >> + struct i915_user_extension base; >> + >> + /** >> + * @fence_count: Number of elements in the @handles_ptr & >> @value_ptr >> + * arrays. >> + */ >> + __u64 fence_count; >> + >> + /** >> + * @handles_ptr: Pointer to an array of struct >> drm_i915_vm_bind_fence >> + * of length @fence_count. >> + */ >> + __u64 handles_ptr; >> + >> + /** >> + * @values_ptr: Pointer to an array of u64 values of length >> + * @fence_count. >> + * Values must be 0 for a binary drm_syncobj. A Value of 0 >> for a >> + * timeline drm_syncobj is invalid as it turns a >> drm_syncobj into a >> + * binary one. >> + */ >> + __u64 values_ptr; >> +}; >> + >> +/** >> + * struct drm_i915_vm_bind_user_fence - An input or output user >> fence for the >> + * vm_bind or the vm_unbind work. >> + * >> + * The vm_bind or vm_unbind aync worker will wait for the input >> fence (value at >> + * @addr to become equal to @val) before starting the binding >> or unbinding. >> + * >> + * The vm_bind or vm_unbind async worker will signal the output >> fence after >> + * the completion of binding or unbinding by writing @val to >> memory location at >> + * @addr >> + */ >> +struct drm_i915_vm_bind_user_fence { >> + /** @addr: User/Memory fence qword aligned process virtual >> address */ >> + __u64 addr; >> + >> + /** @val: User/Memory fence value to be written after bind >> completion */ >> + __u64 val; >> + >> + /** >> + * @flags: Supported flags are, >> + * >> + * I915_VM_BIND_USER_FENCE_WAIT: >> + * Wait for the input fence before binding/unbinding >> + * >> + * I915_VM_BIND_USER_FENCE_SIGNAL: >> + * Return bind/unbind completion fence as output >> + */ >> + __u32 flags; >> +#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) >> +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) >> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ >> + (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) >> +}; >> + >> +/** >> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences >> for vm_bind >> + * and vm_unbind. >> + * >> + * These user fences can be input or output fences >> + * (See struct drm_i915_vm_bind_user_fence). >> + */ >> +struct drm_i915_vm_bind_ext_user_fence { >> +#define I915_VM_BIND_EXT_USER_FENCES 1 >> + /** @base: Extension link. See struct i915_user_extension. */ >> + struct i915_user_extension base; >> + >> + /** @fence_count: Number of elements in the @user_fence_ptr >> array. */ >> + __u64 fence_count; >> + >> + /** >> + * @user_fence_ptr: Pointer to an array of >> + * struct drm_i915_vm_bind_user_fence of length @fence_count. >> + */ >> + __u64 user_fence_ptr; >> +}; >> + >> +/** >> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array >> of batch buffer >> + * gpu virtual addresses. >> + * >> + * In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), >> this extension >> + * must always be appended in the VM_BIND mode and it will be >> an error to >> + * append this extension in older non-VM_BIND mode. >> + */ >> +struct drm_i915_gem_execbuffer_ext_batch_addresses { >> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1 >> + /** @base: Extension link. See struct i915_user_extension. */ >> + struct i915_user_extension base; >> + >> + /** @count: Number of addresses in the addr array. */ >> + __u32 count; >> + >> + /** @addr: An array of batch gpu virtual addresses. */ >> + __u64 addr[0]; >> +}; >> + >> +/** >> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level >> batch completion >> + * signaling extension. >> + * >> + * This extension allows user to attach a user fence (@addr, >> @value pair) to an >> + * execbuf to be signaled by the command streamer after the >> completion of first >> + * level batch, by writing the @value at specified @addr and >> triggering an >> + * interrupt. >> + * User can either poll for this user fence to signal or can >> also wait on it >> + * with i915_gem_wait_user_fence ioctl. >> + * This is very much usefaul for long running contexts where >> waiting on dma-fence >> + * by user (like i915_gem_wait ioctl) is not supported. >> + */ >> +struct drm_i915_gem_execbuffer_ext_user_fence { >> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2 >> + /** @base: Extension link. See struct i915_user_extension. */ >> + struct i915_user_extension base; >> + >> + /** >> + * @addr: User/Memory fence qword aligned GPU virtual address. >> + * >> + * Address has to be a valid GPU virtual address at the >> time of >> + * first level batch completion. >> + */ >> + __u64 addr; >> + >> + /** >> + * @value: User/Memory fence Value to be written to above >> address >> + * after first level batch completes. >> + */ >> + __u64 value; >> + >> + /** @rsvd: Reserved for future extensions, MBZ */ >> + __u64 rsvd; >> +}; >> + >> +/** >> + * struct drm_i915_gem_create_ext_vm_private - Extension to >> make the object >> + * private to the specified VM. >> + * >> + * See struct drm_i915_gem_create_ext. >> + */ >> +struct drm_i915_gem_create_ext_vm_private { >> +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 >> + /** @base: Extension link. See struct i915_user_extension. */ >> + struct i915_user_extension base; >> + >> + /** @vm_id: Id of the VM to which the object is private */ >> + __u32 vm_id; >> +}; >> + >> +/** >> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory >> fence. >> + * >> + * User/Memory fence can be woken up either by: >> + * >> + * 1. GPU context indicated by @ctx_id, or, >> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT. >> + * @ctx_id is ignored when this flag is set. >> + * >> + * Wakeup condition is, >> + * ``((*addr & mask) op (value & mask))`` >> + * >> + * See :ref:`Documentation/driver-api/dma-buf.rst >> <indefinite_dma_fences>` >> + */ >> +struct drm_i915_gem_wait_user_fence { >> + /** @extensions: Zero-terminated chain of extensions. */ >> + __u64 extensions; >> + >> + /** @addr: User/Memory fence address */ >> + __u64 addr; >> + >> + /** @ctx_id: Id of the Context which will signal the fence. */ >> + __u32 ctx_id; >> + >> + /** @op: Wakeup condition operator */ >> + __u16 op; >> +#define I915_UFENCE_WAIT_EQ 0 >> +#define I915_UFENCE_WAIT_NEQ 1 >> +#define I915_UFENCE_WAIT_GT 2 >> +#define I915_UFENCE_WAIT_GTE 3 >> +#define I915_UFENCE_WAIT_LT 4 >> +#define I915_UFENCE_WAIT_LTE 5 >> +#define I915_UFENCE_WAIT_BEFORE 6 >> +#define I915_UFENCE_WAIT_AFTER 7 >> + >> + /** >> + * @flags: Supported flags are, >> + * >> + * I915_UFENCE_WAIT_SOFT: >> + * >> + * To be woken up by i915 driver async worker (not by GPU). >> + * >> + * I915_UFENCE_WAIT_ABSTIME: >> + * >> + * Wait timeout specified as absolute time. >> + */ >> + __u16 flags; >> +#define I915_UFENCE_WAIT_SOFT 0x1 >> +#define I915_UFENCE_WAIT_ABSTIME 0x2 >> + >> + /** @value: Wakeup value */ >> + __u64 value; >> + >> + /** @mask: Wakeup mask */ >> + __u64 mask; >> +#define I915_UFENCE_WAIT_U8 0xffu >> +#define I915_UFENCE_WAIT_U16 0xffffu >> +#define I915_UFENCE_WAIT_U32 0xfffffffful >> +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull >> + >> + /** >> + * @timeout: Wait timeout in nanoseconds. >> + * >> + * If I915_UFENCE_WAIT_ABSTIME flag is set, then time >> timeout is the >> + * absolute time in nsec. >> + */ >> + __s64 timeout; >> +};

Matthew Auld

10:32 a.m.

New subject: [Intel-gfx] [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On 10/06/2022 11:16, Tvrtko Ursulin wrote:

...

On 09/06/2022 19:53, Niranjana Vishwanathapura wrote:

...
On Thu, Jun 09, 2022 at 09:36:48AM +0100, Matthew Auld wrote:

...
On 08/06/2022 22:32, Niranjana Vishwanathapura wrote:

...
On Wed, Jun 08, 2022 at 10:12:05AM +0100, Matthew Auld wrote:

...
On 08/06/2022 08:17, Tvrtko Ursulin wrote:

...
On 07/06/2022 20:37, Niranjana Vishwanathapura wrote: > On Tue, Jun 07, 2022 at 11:27:14AM +0100, Tvrtko Ursulin wrote: >> >> On 17/05/2022 19:32, Niranjana Vishwanathapura wrote: >>> VM_BIND and related uapi definitions >>> >>> v2: Ensure proper kernel-doc formatting with cross references. >>> Also add new uapi and documentation as per review comments >>> from Daniel. >>> >>> Signed-off-by: Niranjana Vishwanathapura >>> niranjana.vishwanathapura@intel.com >>> --- >>> Documentation/gpu/rfc/i915_vm_bind.h | 399 >>> +++++++++++++++++++++++++++ >>> 1 file changed, 399 insertions(+) >>> create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h >>> >>> diff --git a/Documentation/gpu/rfc/i915_vm_bind.h >>> b/Documentation/gpu/rfc/i915_vm_bind.h >>> new file mode 100644 >>> index 000000000000..589c0a009107 >>> --- /dev/null >>> +++ b/Documentation/gpu/rfc/i915_vm_bind.h >>> @@ -0,0 +1,399 @@ >>> +/* SPDX-License-Identifier: MIT */ >>> +/* >>> + * Copyright © 2022 Intel Corporation >>> + */ >>> + >>> +/** >>> + * DOC: I915_PARAM_HAS_VM_BIND >>> + * >>> + * VM_BIND feature availability. >>> + * See typedef drm_i915_getparam_t param. >>> + */ >>> +#define I915_PARAM_HAS_VM_BIND 57 >>> + >>> +/** >>> + * DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND >>> + * >>> + * Flag to opt-in for VM_BIND mode of binding during VM creation. >>> + * See struct drm_i915_gem_vm_control flags. >>> + * >>> + * A VM in VM_BIND mode will not support the older execbuff >>> mode of binding. >>> + * In VM_BIND mode, execbuff ioctl will not accept any >>> execlist (ie., the >>> + * &drm_i915_gem_execbuffer2.buffer_count must be 0). >>> + * Also, &drm_i915_gem_execbuffer2.batch_start_offset and >>> + * &drm_i915_gem_execbuffer2.batch_len must be 0. >>> + * DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must >>> be provided >>> + * to pass in the batch buffer addresses. >>> + * >>> + * Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and >>> + * I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags >>> must be 0 >>> + * (not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag >>> must always be >>> + * set (See struct drm_i915_gem_execbuffer_ext_batch_addresses). >>> + * The buffers_ptr, buffer_count, batch_start_offset and >>> batch_len fields >>> + * of struct drm_i915_gem_execbuffer2 are also not used and >>> must be 0. >>> + */ >>> +#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0) >>> + >>> +/** >>> + * DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING >>> + * >>> + * Flag to declare context as long running. >>> + * See struct drm_i915_gem_context_create_ext flags. >>> + * >>> + * Usage of dma-fence expects that they complete in reasonable >>> amount of time. >>> + * Compute on the other hand can be long running. Hence it is >>> not appropriate >>> + * for compute contexts to export request completion dma-fence >>> to user. >>> + * The dma-fence usage will be limited to in-kernel >>> consumption only. >>> + * Compute contexts need to use user/memory fence. >>> + * >>> + * So, long running contexts do not support output fences. Hence, >>> + * I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and >>> + * I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) >>> are expected >>> + * to be not used. >>> + * >>> + * DRM_I915_GEM_WAIT ioctl call is also not supported for >>> objects mapped >>> + * to long running contexts. >>> + */ >>> +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2) >>> + >>> +/* VM_BIND related ioctls */ >>> +#define DRM_I915_GEM_VM_BIND 0x3d >>> +#define DRM_I915_GEM_VM_UNBIND 0x3e >>> +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f >>> + >>> +#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + >>> DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) >>> +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE >>> + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) >>> +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE >>> DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, >>> struct drm_i915_gem_wait_user_fence) >>> + >>> +/** >>> + * struct drm_i915_gem_vm_bind - VA to object mapping to bind. >>> + * >>> + * This structure is passed to VM_BIND ioctl and specifies the >>> mapping of GPU >>> + * virtual address (VA) range to the section of an object that >>> should be bound >>> + * in the device page table of the specified address space (VM). >>> + * The VA range specified must be unique (ie., not currently >>> bound) and can >>> + * be mapped to whole object or a section of the object >>> (partial binding). >>> + * Multiple VA mappings can be created to the same section of >>> the object >>> + * (aliasing). >>> + */ >>> +struct drm_i915_gem_vm_bind { >>> + /** @vm_id: VM (address space) id to bind */ >>> + __u32 vm_id; >>> + >>> + /** @handle: Object handle */ >>> + __u32 handle; >>> + >>> + /** @start: Virtual Address start to bind */ >>> + __u64 start; >>> + >>> + /** @offset: Offset in object to bind */ >>> + __u64 offset; >>> + >>> + /** @length: Length of mapping to bind */ >>> + __u64 length; >> >> Does it support, or should it, equivalent of >> EXEC_OBJECT_PAD_TO_SIZE? Or if not userspace is expected to map >> the remainder of the space to a dummy object? In which case >> would there be any alignment/padding issues preventing the two >> bind to be placed next to each other? >> >> I ask because someone from the compute side asked me about a >> problem with their strategy of dealing with overfetch and I >> suggested pad to size. >> > > Thanks Tvrtko, > I think we shouldn't be needing it. As with VM_BIND VA assignment > is completely pushed to userspace, no padding should be necessary > once the 'start' and 'size' alignment conditions are met. > > I will add some documentation on alignment requirement here. > Generally, 'start' and 'size' should be 4K aligned. But, I think > when we have 64K lmem page sizes (dg2 and xehpsdv), they need to > be 64K aligned.

Matt

Align to 64k is enough for all overfetch issues?

Apparently compute has a situation where a buffer is received by one component and another has to apply more alignment to it, to deal with overfetch. Since they cannot grow the actual BO if they wanted to VM_BIND a scratch area on top? Or perhaps none of this is a problem on discrete and original BO should be correctly allocated to start with.

Side question - what about the align to 2MiB mentioned in i915_vma_insert to avoid mixing 4k and 64k PTEs? That does not apply to discrete?

Not sure about the overfetch thing, but yeah dg2 & xehpsdv both require a minimum of 64K pages underneath for local memory, and the BO size will also be rounded up accordingly. And yeah the complication arises due to not being able to mix 4K + 64K GTT pages within the same page-table (existed since even gen8). Note that 4K here is what we typically get for system memory.

Originally we had a memory coloring scheme to track the "color" of each page-table, which basically ensures that userspace can't do something nasty like mixing page sizes. The advantage of that scheme is that we would only require 64K GTT alignment and no extra padding, but is perhaps a little complex.

The merged solution is just to align and pad (i.e vma->node.size and not vma->size) out of the vma to 2M, which is dead simple implementation wise, but does potentially waste some GTT space and some of the local memory used for the actual page-table. For the alignment the kernel just validates that the GTT address is aligned to 2M in vma_insert(), and then for the padding it just inflates it to 2M, if userspace hasn't already.

See the kernel-doc for @size: https://dri.freedesktop.org/docs/drm/gpu/driver-uapi.html?#c.drm_i915_gem_cr...

Ok, those requirements (2M VA alignment) will apply to VM_BIND also. This is unfortunate, but it is not something new enforced by VM_BIND. Other option is to go with 64K alignment and in VM_BIND case, user must ensure there is no mix-matching of 64K (lmem) and 4k (smem) mappings in the same 2M range. But this is not VM_BIND specific (will apply to soft-pinning in execbuf2 also).

I don't think we need any VA padding here as with VM_BIND VA is managed fully by the user. If we enforce VA to be 2M aligned, it will leave holes (if BOs are smaller then 2M), but nobody is going to allocate anything form there.

Note that we only apply the 2M alignment + padding for local memory pages, for system memory we don't have/need such restrictions. The VA padding then importantly prevents userspace from incorrectly (or maliciously) inserting 4K system memory object in some page-table operating in 64K GTT mode.

Thanks Matt. I also, syned offline with Matt a bit on this. We don't need explicit 'pad_to_size' size. i915 driver is implicitly padding the size to 2M boundary for LMEM BOs which will apply for VM_BIND also. The remaining question is whether we enforce 2M VA alignment for lmem BOs (just like legacy execbuff path) on dg2 & xehpsdv, or go with just 64K alignment but ensure there is no mixing of 4K and 64K

"Driver is implicitly padding the size to 2MB boundary" - this is the backing store?

Just the GTT space, i.e vma->node.size. Backing store just needs to use 64K pages.

...

...
mappings in same 2M range. I think we can go with 2M alignment requirement for VM_BIND also. So, no new requirements here for VM_BIND.

Are there any considerations here of letting the userspace know? Presumably userspace allocator has to know or it would try to ask for impossible addresses.

It's the existing behaviour with execbuf, so I assume userspace must already get this right, on platforms like dg2.

...

Regards,

Tvrtko

...
I will update the documentation.

Niranjana

...
...
Niranjana

...
...
Regards,

Tvrtko

> > Niranjana > >> Regards, >> >> Tvrtko >> >>> + >>> + /** >>> + * @flags: Supported flags are, >>> + * >>> + * I915_GEM_VM_BIND_READONLY: >>> + * Mapping is read-only. >>> + * >>> + * I915_GEM_VM_BIND_CAPTURE: >>> + * Capture this mapping in the dump upon GPU error. >>> + */ >>> + __u64 flags; >>> +#define I915_GEM_VM_BIND_READONLY (1 << 0) >>> +#define I915_GEM_VM_BIND_CAPTURE (1 << 1) >>> + >>> + /** @extensions: 0-terminated chain of extensions for this >>> mapping. */ >>> + __u64 extensions; >>> +}; >>> + >>> +/** >>> + * struct drm_i915_gem_vm_unbind - VA to object mapping to >>> unbind. >>> + * >>> + * This structure is passed to VM_UNBIND ioctl and specifies >>> the GPU virtual >>> + * address (VA) range that should be unbound from the device >>> page table of the >>> + * specified address space (VM). The specified VA range must >>> match one of the >>> + * mappings created with the VM_BIND ioctl. TLB is flushed >>> upon unbind >>> + * completion. >>> + */ >>> +struct drm_i915_gem_vm_unbind { >>> + /** @vm_id: VM (address space) id to bind */ >>> + __u32 vm_id; >>> + >>> + /** @rsvd: Reserved for future use; must be zero. */ >>> + __u32 rsvd; >>> + >>> + /** @start: Virtual Address start to unbind */ >>> + __u64 start; >>> + >>> + /** @length: Length of mapping to unbind */ >>> + __u64 length; >>> + >>> + /** @flags: reserved for future usage, currently MBZ */ >>> + __u64 flags; >>> + >>> + /** @extensions: 0-terminated chain of extensions for this >>> mapping. */ >>> + __u64 extensions; >>> +}; >>> + >>> +/** >>> + * struct drm_i915_vm_bind_fence - An input or output fence >>> for the vm_bind >>> + * or the vm_unbind work. >>> + * >>> + * The vm_bind or vm_unbind aync worker will wait for input >>> fence to signal >>> + * before starting the binding or unbinding. >>> + * >>> + * The vm_bind or vm_unbind async worker will signal the >>> returned output fence >>> + * after the completion of binding or unbinding. >>> + */ >>> +struct drm_i915_vm_bind_fence { >>> + /** @handle: User's handle for a drm_syncobj to wait on or >>> signal. */ >>> + __u32 handle; >>> + >>> + /** >>> + * @flags: Supported flags are, >>> + * >>> + * I915_VM_BIND_FENCE_WAIT: >>> + * Wait for the input fence before binding/unbinding >>> + * >>> + * I915_VM_BIND_FENCE_SIGNAL: >>> + * Return bind/unbind completion fence as output >>> + */ >>> + __u32 flags; >>> +#define I915_VM_BIND_FENCE_WAIT (1<<0) >>> +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) >>> +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS >>> (-(I915_VM_BIND_FENCE_SIGNAL << 1)) >>> +}; >>> + >>> +/** >>> + * struct drm_i915_vm_bind_ext_timeline_fences - Timeline >>> fences for vm_bind >>> + * and vm_unbind. >>> + * >>> + * This structure describes an array of timeline drm_syncobj >>> and associated >>> + * points for timeline variants of drm_syncobj. These timeline >>> 'drm_syncobj's >>> + * can be input or output fences (See struct >>> drm_i915_vm_bind_fence). >>> + */ >>> +struct drm_i915_vm_bind_ext_timeline_fences { >>> +#define I915_VM_BIND_EXT_timeline_FENCES 0 >>> + /** @base: Extension link. See struct i915_user_extension. */ >>> + struct i915_user_extension base; >>> + >>> + /** >>> + * @fence_count: Number of elements in the @handles_ptr & >>> @value_ptr >>> + * arrays. >>> + */ >>> + __u64 fence_count; >>> + >>> + /** >>> + * @handles_ptr: Pointer to an array of struct >>> drm_i915_vm_bind_fence >>> + * of length @fence_count. >>> + */ >>> + __u64 handles_ptr; >>> + >>> + /** >>> + * @values_ptr: Pointer to an array of u64 values of length >>> + * @fence_count. >>> + * Values must be 0 for a binary drm_syncobj. A Value of 0 >>> for a >>> + * timeline drm_syncobj is invalid as it turns a >>> drm_syncobj into a >>> + * binary one. >>> + */ >>> + __u64 values_ptr; >>> +}; >>> + >>> +/** >>> + * struct drm_i915_vm_bind_user_fence - An input or output >>> user fence for the >>> + * vm_bind or the vm_unbind work. >>> + * >>> + * The vm_bind or vm_unbind aync worker will wait for the >>> input fence (value at >>> + * @addr to become equal to @val) before starting the binding >>> or unbinding. >>> + * >>> + * The vm_bind or vm_unbind async worker will signal the >>> output fence after >>> + * the completion of binding or unbinding by writing @val to >>> memory location at >>> + * @addr >>> + */ >>> +struct drm_i915_vm_bind_user_fence { >>> + /** @addr: User/Memory fence qword aligned process virtual >>> address */ >>> + __u64 addr; >>> + >>> + /** @val: User/Memory fence value to be written after bind >>> completion */ >>> + __u64 val; >>> + >>> + /** >>> + * @flags: Supported flags are, >>> + * >>> + * I915_VM_BIND_USER_FENCE_WAIT: >>> + * Wait for the input fence before binding/unbinding >>> + * >>> + * I915_VM_BIND_USER_FENCE_SIGNAL: >>> + * Return bind/unbind completion fence as output >>> + */ >>> + __u32 flags; >>> +#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) >>> +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) >>> +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \ >>> + (-(I915_VM_BIND_USER_FENCE_SIGNAL << 1)) >>> +}; >>> + >>> +/** >>> + * struct drm_i915_vm_bind_ext_user_fence - User/memory fences >>> for vm_bind >>> + * and vm_unbind. >>> + * >>> + * These user fences can be input or output fences >>> + * (See struct drm_i915_vm_bind_user_fence). >>> + */ >>> +struct drm_i915_vm_bind_ext_user_fence { >>> +#define I915_VM_BIND_EXT_USER_FENCES 1 >>> + /** @base: Extension link. See struct i915_user_extension. */ >>> + struct i915_user_extension base; >>> + >>> + /** @fence_count: Number of elements in the >>> @user_fence_ptr array. */ >>> + __u64 fence_count; >>> + >>> + /** >>> + * @user_fence_ptr: Pointer to an array of >>> + * struct drm_i915_vm_bind_user_fence of length @fence_count. >>> + */ >>> + __u64 user_fence_ptr; >>> +}; >>> + >>> +/** >>> + * struct drm_i915_gem_execbuffer_ext_batch_addresses - Array >>> of batch buffer >>> + * gpu virtual addresses. >>> + * >>> + * In the execbuff ioctl (See struct >>> drm_i915_gem_execbuffer2), this extension >>> + * must always be appended in the VM_BIND mode and it will be >>> an error to >>> + * append this extension in older non-VM_BIND mode. >>> + */ >>> +struct drm_i915_gem_execbuffer_ext_batch_addresses { >>> +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1 >>> + /** @base: Extension link. See struct i915_user_extension. */ >>> + struct i915_user_extension base; >>> + >>> + /** @count: Number of addresses in the addr array. */ >>> + __u32 count; >>> + >>> + /** @addr: An array of batch gpu virtual addresses. */ >>> + __u64 addr[0]; >>> +}; >>> + >>> +/** >>> + * struct drm_i915_gem_execbuffer_ext_user_fence - First level >>> batch completion >>> + * signaling extension. >>> + * >>> + * This extension allows user to attach a user fence (@addr, >>> @value pair) to an >>> + * execbuf to be signaled by the command streamer after the >>> completion of first >>> + * level batch, by writing the @value at specified @addr and >>> triggering an >>> + * interrupt. >>> + * User can either poll for this user fence to signal or can >>> also wait on it >>> + * with i915_gem_wait_user_fence ioctl. >>> + * This is very much usefaul for long running contexts where >>> waiting on dma-fence >>> + * by user (like i915_gem_wait ioctl) is not supported. >>> + */ >>> +struct drm_i915_gem_execbuffer_ext_user_fence { >>> +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2 >>> + /** @base: Extension link. See struct i915_user_extension. */ >>> + struct i915_user_extension base; >>> + >>> + /** >>> + * @addr: User/Memory fence qword aligned GPU virtual >>> address. >>> + * >>> + * Address has to be a valid GPU virtual address at the >>> time of >>> + * first level batch completion. >>> + */ >>> + __u64 addr; >>> + >>> + /** >>> + * @value: User/Memory fence Value to be written to above >>> address >>> + * after first level batch completes. >>> + */ >>> + __u64 value; >>> + >>> + /** @rsvd: Reserved for future extensions, MBZ */ >>> + __u64 rsvd; >>> +}; >>> + >>> +/** >>> + * struct drm_i915_gem_create_ext_vm_private - Extension to >>> make the object >>> + * private to the specified VM. >>> + * >>> + * See struct drm_i915_gem_create_ext. >>> + */ >>> +struct drm_i915_gem_create_ext_vm_private { >>> +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 >>> + /** @base: Extension link. See struct i915_user_extension. */ >>> + struct i915_user_extension base; >>> + >>> + /** @vm_id: Id of the VM to which the object is private */ >>> + __u32 vm_id; >>> +}; >>> + >>> +/** >>> + * struct drm_i915_gem_wait_user_fence - Wait on user/memory >>> fence. >>> + * >>> + * User/Memory fence can be woken up either by: >>> + * >>> + * 1. GPU context indicated by @ctx_id, or, >>> + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT. >>> + * @ctx_id is ignored when this flag is set. >>> + * >>> + * Wakeup condition is, >>> + * ``((*addr & mask) op (value & mask))`` >>> + * >>> + * See :ref:`Documentation/driver-api/dma-buf.rst >>> <indefinite_dma_fences>` >>> + */ >>> +struct drm_i915_gem_wait_user_fence { >>> + /** @extensions: Zero-terminated chain of extensions. */ >>> + __u64 extensions; >>> + >>> + /** @addr: User/Memory fence address */ >>> + __u64 addr; >>> + >>> + /** @ctx_id: Id of the Context which will signal the >>> fence. */ >>> + __u32 ctx_id; >>> + >>> + /** @op: Wakeup condition operator */ >>> + __u16 op; >>> +#define I915_UFENCE_WAIT_EQ 0 >>> +#define I915_UFENCE_WAIT_NEQ 1 >>> +#define I915_UFENCE_WAIT_GT 2 >>> +#define I915_UFENCE_WAIT_GTE 3 >>> +#define I915_UFENCE_WAIT_LT 4 >>> +#define I915_UFENCE_WAIT_LTE 5 >>> +#define I915_UFENCE_WAIT_BEFORE 6 >>> +#define I915_UFENCE_WAIT_AFTER 7 >>> + >>> + /** >>> + * @flags: Supported flags are, >>> + * >>> + * I915_UFENCE_WAIT_SOFT: >>> + * >>> + * To be woken up by i915 driver async worker (not by GPU). >>> + * >>> + * I915_UFENCE_WAIT_ABSTIME: >>> + * >>> + * Wait timeout specified as absolute time. >>> + */ >>> + __u16 flags; >>> +#define I915_UFENCE_WAIT_SOFT 0x1 >>> +#define I915_UFENCE_WAIT_ABSTIME 0x2 >>> + >>> + /** @value: Wakeup value */ >>> + __u64 value; >>> + >>> + /** @mask: Wakeup mask */ >>> + __u64 mask; >>> +#define I915_UFENCE_WAIT_U8 0xffu >>> +#define I915_UFENCE_WAIT_U16 0xffffu >>> +#define I915_UFENCE_WAIT_U32 0xfffffffful >>> +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull >>> + >>> + /** >>> + * @timeout: Wait timeout in nanoseconds. >>> + * >>> + * If I915_UFENCE_WAIT_ABSTIME flag is set, then time >>> timeout is the >>> + * absolute time in nsec. >>> + */ >>> + __s64 timeout; >>> +};

Matthew Brost

8:34 a.m.

New subject: [RFC v3 3/3] drm/doc/rfc: VM_BIND uapi definition

On Tue, May 17, 2022 at 11:32:12AM -0700, Niranjana Vishwanathapura wrote:

...

VM_BIND and related uapi definitions

v2: Ensure proper kernel-doc formatting with cross references. Also add new uapi and documentation as per review comments from Daniel.

Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com

Documentation/gpu/rfc/i915_vm_bind.h | 399 +++++++++++++++++++++++++++ 1 file changed, 399 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h

diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..589c0a009107 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: MIT */ +/*

Copyright © 2022 Intel Corporation

*/

+/**

DOC: I915_PARAM_HAS_VM_BIND

VM_BIND feature availability.

See typedef drm_i915_getparam_t param.

*/

+#define I915_PARAM_HAS_VM_BIND 57

+/**

DOC: I915_VM_CREATE_FLAGS_USE_VM_BIND

Flag to opt-in for VM_BIND mode of binding during VM creation.

See struct drm_i915_gem_vm_control flags.

A VM in VM_BIND mode will not support the older execbuff mode of binding.

In VM_BIND mode, execbuff ioctl will not accept any execlist (ie., the

&drm_i915_gem_execbuffer2.buffer_count must be 0).

Also, &drm_i915_gem_execbuffer2.batch_start_offset and

&drm_i915_gem_execbuffer2.batch_len must be 0.

DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES extension must be provided

to pass in the batch buffer addresses.

Additionally, I915_EXEC_NO_RELOC, I915_EXEC_HANDLE_LUT and

I915_EXEC_BATCH_FIRST of &drm_i915_gem_execbuffer2.flags must be 0

(not used) in VM_BIND mode. I915_EXEC_USE_EXTENSIONS flag must always be

set (See struct drm_i915_gem_execbuffer_ext_batch_addresses).

The buffers_ptr, buffer_count, batch_start_offset and batch_len fields

of struct drm_i915_gem_execbuffer2 are also not used and must be 0.

*/

+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0)

+/**

DOC: I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING

Flag to declare context as long running.

See struct drm_i915_gem_context_create_ext flags.

Usage of dma-fence expects that they complete in reasonable amount of time.

Compute on the other hand can be long running. Hence it is not appropriate

for compute contexts to export request completion dma-fence to user.

The dma-fence usage will be limited to in-kernel consumption only.

Compute contexts need to use user/memory fence.

So, long running contexts do not support output fences. Hence,

I915_EXEC_FENCE_OUT (See &drm_i915_gem_execbuffer2.flags and

I915_EXEC_FENCE_SIGNAL (See &drm_i915_gem_exec_fence.flags) are expected

to be not used.

DRM_I915_GEM_WAIT ioctl call is also not supported for objects mapped

to long running contexts.

*/

+#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2)

+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f

+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)

+/**

struct drm_i915_gem_vm_bind - VA to object mapping to bind.

This structure is passed to VM_BIND ioctl and specifies the mapping of GPU

virtual address (VA) range to the section of an object that should be bound

in the device page table of the specified address space (VM).

The VA range specified must be unique (ie., not currently bound) and can

be mapped to whole object or a section of the object (partial binding).

Multiple VA mappings can be created to the same section of the object

(aliasing).

*/

+struct drm_i915_gem_vm_bind {
/** @vm_id: VM (address space) id to bind */

__u32 vm_id;

/** @handle: Object handle */

__u32 handle;

/** @start: Virtual Address start to bind */

__u64 start;

/** @offset: Offset in object to bind */

__u64 offset;

/** @length: Length of mapping to bind */

__u64 length;

/**
* @flags: Supported flags are,
*
* I915_GEM_VM_BIND_READONLY:
* Mapping is read-only.
*
* I915_GEM_VM_BIND_CAPTURE:
* Capture this mapping in the dump upon GPU error.
*/
__u64 flags;
+#define I915_GEM_VM_BIND_READONLY (1 << 0) +#define I915_GEM_VM_BIND_CAPTURE (1 << 1)

/** @extensions: 0-terminated chain of extensions for this mapping. */

__u64 extensions;

+};

+/**

struct drm_i915_gem_vm_unbind - VA to object mapping to unbind.

This structure is passed to VM_UNBIND ioctl and specifies the GPU virtual

address (VA) range that should be unbound from the device page table of the

specified address space (VM). The specified VA range must match one of the

mappings created with the VM_BIND ioctl. TLB is flushed upon unbind

completion.

*/

+struct drm_i915_gem_vm_unbind {

/** @vm_id: VM (address space) id to bind */

__u32 vm_id;

/** @rsvd: Reserved for future use; must be zero. */

__u32 rsvd;

/** @start: Virtual Address start to unbind */

__u64 start;

/** @length: Length of mapping to unbind */

__u64 length;

This probably isn't needed. We are never going to unbind a subset of a VMA are we? That being said it can't hurt as a sanity check (e.g. internal vma->length == user unbind length).

...

/** @flags: reserved for future usage, currently MBZ */

__u64 flags;

/** @extensions: 0-terminated chain of extensions for this mapping. */

__u64 extensions;

+};

+/**

struct drm_i915_vm_bind_fence - An input or output fence for the vm_bind

or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for input fence to signal

before starting the binding or unbinding.

The vm_bind or vm_unbind async worker will signal the returned output fence

after the completion of binding or unbinding.

*/

+struct drm_i915_vm_bind_fence {
/** @handle: User's handle for a drm_syncobj to wait on or signal. */

__u32 handle;

/**
* @flags: Supported flags are,
*
* I915_VM_BIND_FENCE_WAIT:
* Wait for the input fence before binding/unbinding
*
* I915_VM_BIND_FENCE_SIGNAL:
* Return bind/unbind completion fence as output
*/
__u32 flags;
+#define I915_VM_BIND_FENCE_WAIT (1<<0) +#define I915_VM_BIND_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_FENCE_UNKNOWN_FLAGS (-(I915_VM_BIND_FENCE_SIGNAL << 1)) +};

+/**

struct drm_i915_vm_bind_ext_timeline_fences - Timeline fences for vm_bind

and vm_unbind.

This structure describes an array of timeline drm_syncobj and associated

points for timeline variants of drm_syncobj. These timeline 'drm_syncobj's

can be input or output fences (See struct drm_i915_vm_bind_fence).

*/

+struct drm_i915_vm_bind_ext_timeline_fences { +#define I915_VM_BIND_EXT_timeline_FENCES 0
/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/**
* @fence_count: Number of elements in the @handles_ptr & @value_ptr
* arrays.
*/
__u64 fence_count;

/**
* @handles_ptr: Pointer to an array of struct drm_i915_vm_bind_fence
* of length @fence_count.
*/
__u64 handles_ptr;

/**
* @values_ptr: Pointer to an array of u64 values of length
* @fence_count.
* Values must be 0 for a binary drm_syncobj. A Value of 0 for a
* timeline drm_syncobj is invalid as it turns a drm_syncobj into a
* binary one.
*/
__u64 values_ptr;
+};

+/**

struct drm_i915_vm_bind_user_fence - An input or output user fence for the

vm_bind or the vm_unbind work.

The vm_bind or vm_unbind aync worker will wait for the input fence (value at

@addr to become equal to @val) before starting the binding or unbinding.

The vm_bind or vm_unbind async worker will signal the output fence after

the completion of binding or unbinding by writing @val to memory location at

@addr

*/

+struct drm_i915_vm_bind_user_fence {
/** @addr: User/Memory fence qword aligned process virtual address */

__u64 addr;

/** @val: User/Memory fence value to be written after bind completion */

__u64 val;

/**
* @flags: Supported flags are,
*
* I915_VM_BIND_USER_FENCE_WAIT:
* Wait for the input fence before binding/unbinding
*
* I915_VM_BIND_USER_FENCE_SIGNAL:
* Return bind/unbind completion fence as output
*/
__u32 flags;
+#define I915_VM_BIND_USER_FENCE_WAIT (1<<0) +#define I915_VM_BIND_USER_FENCE_SIGNAL (1<<1) +#define __I915_VM_BIND_USER_FENCE_UNKNOWN_FLAGS \

(-(I915_VM_BIND_USER_FENCE_SIGNAL << 1))

+};

+/**

struct drm_i915_vm_bind_ext_user_fence - User/memory fences for vm_bind

and vm_unbind.

These user fences can be input or output fences

(See struct drm_i915_vm_bind_user_fence).

*/

+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCES 1
/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/** @fence_count: Number of elements in the @user_fence_ptr array. */

__u64 fence_count;

/**
* @user_fence_ptr: Pointer to an array of
* struct drm_i915_vm_bind_user_fence of length @fence_count.
*/
__u64 user_fence_ptr;
+};

IMO all of these fence structs should be a generic sync interface shared between both vm bind and exec3 rather than unique extenisons.

Both vm bind and exec3 should have something like this:

__64 syncs; /* userptr to an array of generic syncs */ __64 n_syncs;

Having an array of syncs lets the kernel do one user copy for all the syncs rather than reading them in a a chain.

A generic sync object encapsulates all possible syncs (in / out - syncobj, syncobj timeline, ufence, future sync concepts).

e.g.

struct { __u32 user_ext; __u32 flag; /* in / out, type, whatever else info we need */ union { __u32 handle; /* to syncobj */ __u64 addr; /* ufence address */ }; __64 seqno; /* syncobj timeline, ufence write value */ ...reserve enough bits for future... }

This unifies binds and execs by using the same sync interface instilling the concept that binds and execs are the same op (queue'd operation /w in/out fences).

Matt

...

+/**

struct drm_i915_gem_execbuffer_ext_batch_addresses - Array of batch buffer

gpu virtual addresses.

In the execbuff ioctl (See struct drm_i915_gem_execbuffer2), this extension

must always be appended in the VM_BIND mode and it will be an error to

append this extension in older non-VM_BIND mode.

*/

+struct drm_i915_gem_execbuffer_ext_batch_addresses { +#define DRM_I915_GEM_EXECBUFFER_EXT_BATCH_ADDRESSES 1

/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/** @count: Number of addresses in the addr array. */

__u32 count;

/** @addr: An array of batch gpu virtual addresses. */

__u64 addr[0];

+};

+/**

struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion

signaling extension.

This extension allows user to attach a user fence (@addr, @value pair) to an

execbuf to be signaled by the command streamer after the completion of first

level batch, by writing the @value at specified @addr and triggering an

interrupt.

User can either poll for this user fence to signal or can also wait on it

with i915_gem_wait_user_fence ioctl.

This is very much usefaul for long running contexts where waiting on dma-fence

by user (like i915_gem_wait ioctl) is not supported.

*/

+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 2
/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/**
* @addr: User/Memory fence qword aligned GPU virtual address.
*
* Address has to be a valid GPU virtual address at the time of
* first level batch completion.
*/
__u64 addr;

/**
* @value: User/Memory fence Value to be written to above address
* after first level batch completes.
*/
__u64 value;

/** @rsvd: Reserved for future extensions, MBZ */

__u64 rsvd;
+};

+/**

struct drm_i915_gem_create_ext_vm_private - Extension to make the object

private to the specified VM.

See struct drm_i915_gem_create_ext.

*/

+struct drm_i915_gem_create_ext_vm_private { +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2

/** @base: Extension link. See struct i915_user_extension. */

struct i915_user_extension base;

/** @vm_id: Id of the VM to which the object is private */

__u32 vm_id;

+};

+/**

struct drm_i915_gem_wait_user_fence - Wait on user/memory fence.

User/Memory fence can be woken up either by:

GPU context indicated by @ctx_id, or,

Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.

@ctx_id is ignored when this flag is set.

Wakeup condition is,

``((*addr & mask) op (value & mask))``

See :ref:`Documentation/driver-api/dma-buf.rst <indefinite_dma_fences>`

*/

+struct drm_i915_gem_wait_user_fence {

/** @extensions: Zero-terminated chain of extensions. */

__u64 extensions;

/** @addr: User/Memory fence address */

__u64 addr;

/** @ctx_id: Id of the Context which will signal the fence. */

__u32 ctx_id;

/** @op: Wakeup condition operator */

__u16 op;

+#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7
/**
* @flags: Supported flags are,
*
* I915_UFENCE_WAIT_SOFT:
*
* To be woken up by i915 driver async worker (not by GPU).
*
* I915_UFENCE_WAIT_ABSTIME:
*
* Wait timeout specified as absolute time.
*/
__u16 flags;
+#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2

/** @value: Wakeup value */

__u64 value;

/** @mask: Wakeup mask */

__u64 mask;

+#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull
/**
* @timeout: Wait timeout in nanoseconds.
*
* If I915_UFENCE_WAIT_ABSTIME flag is set, then time timeout is the
* absolute time in nsec.
*/
__s64 timeout;
+};

2.21.0.rc0.32.g243a4c7e27

1068

Age (days ago)

1097

Last active (days ago)

dri-devel@lists.freedesktop.org

71 comments

12 participants

tags (0)

participants (12)

Bas Nieuwenhuizen
Daniel Vetter
Dave Airlie
Jason Ekstrand
Lionel Landwerlin
Matthew Auld
Matthew Auld
Matthew Brost
Niranjana Vishwanathapura
Tvrtko Ursulin
Zanoni, Paulo R
Zeng, Oak