[Intel-gfx] [RFC v2 1/2] drm/doc/rfc: VM_BIND feature design document

Wed Apr 20 22:50:00 UTC 2022

On Thu, Mar 31, 2022 at 01:37:08PM +0200, Daniel Vetter wrote:
>One thing I've forgotten, since it's only hinted at here: If/when we
>switch tlb flushing from the current dumb&synchronous implementation
>we now have in i915 in upstream to one with batching using dma_fence,
>then I think that should be something which is done with a small
>helper library of shared code too. The batching is somewhat tricky,
>and you need to make sure you put the fence into the right
>dma_resv_usage slot, and the trick with replace the vm fence with a
>tlb flush fence is also a good reason to share the code so we only
>have it one.
>
>Christian's recent work also has some prep work for this already with
>the fence replacing trick.

Sure, but this optimization is not required for initial vm_bind support
to land right? We can look at it soon after that. Is that ok?
I have made a reference to this TLB flush batching work in the rst file.

Niranjana

>-Daniel
>
>On Thu, 31 Mar 2022 at 10:28, Daniel Vetter <daniel at ffwll.ch> wrote:
>> Adding a pile of people who've expressed interest in vm_bind for their
>> drivers.
>>
>> Also note to the intel folks: This is largely written with me having my
>> subsystem co-maintainer hat on, i.e. what I think is the right thing to do
>> here for the subsystem at large. There is substantial rework involved
>> here, but it's not any different from i915 adopting ttm or i915 adpoting
>> drm/sched, and I do think this stuff needs to happen in one form or
>> another.
>>
>> On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
>> > VM_BIND design document with description of intended use cases.
>> >
>> > Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura at intel.com>
>> > ---
>> >  Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++
>> >  Documentation/gpu/rfc/index.rst        |   4 +
>> >  2 files changed, 214 insertions(+)
>> >  create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
>> >
>> > diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst
>> > new file mode 100644
>> > index 000000000000..cdc6bb25b942
>> > --- /dev/null
>> > +++ b/Documentation/gpu/rfc/i915_vm_bind.rst
>> > @@ -0,0 +1,210 @@
>> > +==========================================
>> > +I915 VM_BIND feature design and use cases
>> > +==========================================
>> > +
>> > +VM_BIND feature
>> > +================
>> > +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
>> > +objects (BOs) or sections of a BOs at specified GPU virtual addresses on
>> > +a specified address space (VM).
>> > +
>> > +These mappings (also referred to as persistent mappings) will be persistent
>> > +across multiple GPU submissions (execbuff) issued by the UMD, without user
>> > +having to provide a list of all required mappings during each submission
>> > +(as required by older execbuff mode).
>> > +
>> > +VM_BIND ioctl deferes binding the mappings until next execbuff submission
>> > +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE
>> > +flag is set (useful if mapping is required for an active context).
>>
>> So this is a screw-up I've done, and for upstream I think we need to fix
>> it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and
>> I was wrong suggesting we should do this a few years back when we kicked
>> this off internally :-(
>>
>> What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few
>> things on top:
>> - in and out fences, like with execbuf, to allow userspace to sync with
>>   execbuf as needed
>> - for compute-mode context this means userspace memory fences
>> - for legacy context this means a timeline syncobj in drm_syncobj
>>
>> No sync_file or anything else like this at all. This means a bunch of
>> work, but also it'll have benefits because it means we should be able to
>> use exactly the same code paths and logic for both compute and for legacy
>> context, because drm_syncobj support future fence semantics.
>>
>> Also on the implementation side we still need to install dma_fence to the
>> various dma_resv, and for this we need the new dma_resv_usage series from
>> Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING
>> flag to make sure they never result in an oversync issue with execbuf. I
>> don't think trying to land vm_bind without that prep work in
>> dma_resv_usage makes sense.
>>
>> Also as soon as dma_resv_usage has landed there's a few cleanups we should
>> do in i915:
>> - ttm bo moving code should probably simplify a bit (and maybe more of the
>>   code should be pushed as helpers into ttm)
>> - clflush code should be moved over to using USAGE_KERNEL and the various
>>   hacks and special cases should be ditched. See df94fd05e69e ("drm/i915:
>>   expand on the kernel-doc for cache_dirty") for a bit more context
>>
>> This is still not yet enough, since if a vm_bind races with an eviction we
>> might stall on the new buffers being readied first before the context can
>> continue. This needs some care to make sure that vma which aren't fully
>> bound yet are on a separate list, and vma which are marked for unbinding
>> are removed from the main working set list as soon as possible.
>>
>> All of these things are relevant for the uapi semantics, which means
>> - they need to be documented in the uapi kerneldoc, ideally with example
>>   flows
>> - umd need to ack this
>>
>> The other thing here is the async/nonblocking path. I think we still need
>> that one, but again it should not sync with anything going on in execbuf,
>> but simply execute the ioctl code in a kernel thread. The idea here is
>> that this works like a special gpu engine, so that compute and vk can
>> schedule bindings interleaved with rendering. This should be enough to get
>> a performant vk sparse binding/textures implementation.
>>
>> But I'm not entirely sure on this one, so this definitely needs acks from
>> umds.
>>
>> > +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
>> > +User has to opt-in for VM_BIND mode of binding for an address space (VM)
>> > +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
>> > +A VM in VM_BIND mode will not support older execbuff mode of binding.
>> > +
>> > +UMDs can still send BOs of these persistent mappings in execlist of execbuff
>> > +for specifying BO dependencies (implicit fencing) and to use BO as a batch,
>> > +but those BOs should be mapped ahead via vm_bind ioctl.
>>
>> should or must?
>>
>> Also I'm not really sure that's a great interface. The batchbuffer really
>> only needs to be an address, so maybe all we need is an extension to
>> supply an u64 batchbuffer address instead of trying to retrofit this into
>> an unfitting current uapi.
>>
>> And for implicit sync there's two things:
>> - for vk I think the right uapi is the dma-buf fence import/export ioctls
>>   from Jason Ekstrand. I think we should land that first instead of
>>   hacking funny concepts together
>> - for gl the dma-buf import/export might not be fast enough, since gl
>>   needs to do a _lot_ of implicit sync. There we might need to use the
>>   execbuffer buffer list, but then we should have extremely clear uapi
>>   rules which disallow _everything_ except setting the explicit sync uapi
>>
>> Again all this stuff needs to be documented in detail in the kerneldoc
>> uapi spec.
>>
>> > +VM_BIND features include,
>> > +- Multiple Virtual Address (VA) mappings can map to the same physical pages
>> > +  of an object (aliasing).
>> > +- VA mapping can map to a partial section of the BO (partial binding).
>> > +- Support capture of persistent mappings in the dump upon GPU error.
>> > +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
>> > +  usecases will be helpful.
>> > +- Asynchronous vm_bind and vm_unbind support.
>> > +- VM_BIND uses user/memory fence mechanism for signaling bind completion
>> > +  and for signaling batch completion in long running contexts (explained
>> > +  below).
>>
>> This should all be in the kerneldoc.
>>
>> > +VM_PRIVATE objects
>> > +------------------
>> > +By default, BOs can be mapped on multiple VMs and can also be dma-buf
>> > +exported. Hence these BOs are referred to as Shared BOs.
>> > +During each execbuff submission, the request fence must be added to the
>> > +dma-resv fence list of all shared BOs mapped on the VM.
>> > +
>> > +VM_BIND feature introduces an optimization where user can create BO which
>> > +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during
>> > +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on
>> > +the VM they are private to and can't be dma-buf exported.
>> > +All private BOs of a VM share the dma-resv object. Hence during each execbuff
>> > +submission, they need only one dma-resv fence list updated. Thus the fast
>> > +path (where required mappings are already bound) submission latency is O(1)
>> > +w.r.t the number of VM private BOs.
>>
>> Two things:
>>
>> - I think the above is required to for initial vm_bind for vk, it kinda
>>   doesn't make much sense without that, and will allow us to match amdgpu
>>   and radeonsi
>>
>> - Christian König just landed ttm bulk lru helpers, and I think we need to
>>   use those. This means vm_bind will only work with the ttm backend, but
>>   that's what we have for the big dgpu where vm_bind helps more in terms
>>   of performance, and the igfx conversion to ttm is already going on.
>>
>> Furthermore the i915 shrinker lru has stopped being an lru, so I think
>> that should also be moved over to the ttm lru in some fashion to make sure
>> we once again have a reasonable and consistent memory aging and reclaim
>> architecture. The current code is just too much of a complete mess.
>>
>> And since this is all fairly integral to how the code arch works I don't
>> think merging a different version which isn't based on ttm bulk lru
>> helpers makes sense.
>>
>> Also I do think the page table lru handling needs to be included here,
>> because that's another complete hand-rolled separate world for not much
>> good reasons. I guess that can happen in parallel with the initial vm_bind
>> bring-up, but it needs to be completed by the time we add the features
>> beyond the initial support needed for vk.
>>
>> > +VM_BIND locking hirarchy
>> > +-------------------------
>> > +VM_BIND locking order is as below.
>> > +
>> > +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
>> > +   vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
>> > +
>> > +   In future, when GPU page faults are supported, we can potentially use a
>> > +   rwsem instead, so that multiple pagefault handlers can take the read side
>> > +   lock to lookup the mapping and hence can run in parallel.
>> > +
>> > +2) The BO's dma-resv lock will protect i915_vma state and needs to be held
>> > +   while binding a vma and while updating dma-resv fence list of a BO.
>> > +   The private BOs of a VM will all share a dma-resv object.
>> > +
>> > +   This lock is held in vm_bind call for immediate binding, during vm_unbind
>> > +   call for unbinding and during execbuff path for binding the mapping and
>> > +   updating the dma-resv fence list of the BO.
>> > +
>> > +3) Spinlock/s to protect some of the VM's lists.
>> > +
>> > +We will also need support for bluk LRU movement of persistent mapping to
>> > +avoid additional latencies in execbuff path.
>>
>> This needs more detail and explanation of how each level is required. Also
>> the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
>>
>> Like "some of the VM's lists" explains pretty much nothing.
>>
>> > +
>> > +GPU page faults
>> > +----------------
>> > +Both older execbuff mode and the newer VM_BIND mode of binding will require
>> > +using dma-fence to ensure residency.
>> > +In future when GPU page faults are supported, no dma-fence usage is required
>> > +as residency is purely managed by installing and removing/invalidating ptes.
>>
>> This is a bit confusing. I think one part of this should be moved into the
>> section with future vm_bind use-cases (we're not going to support page
>> faults with legacy softpin or even worse, relocations). The locking
>> discussion should be part of the much longer list of uses cases that
>> motivate the locking design.
>>
>> > +
>> > +
>> > +User/Memory Fence
>> > +==================
>> > +The idea is to take a user specified virtual address and install an interrupt
>> > +handler to wake up the current task when the memory location passes the user
>> > +supplied filter.
>> > +
>> > +User/Memory fence is a <address, value> pair. To signal the user fence,
>> > +specified value will be written at the specified virtual address and
>> > +wakeup the waiting process. User can wait on an user fence with the
>> > +gem_wait_user_fence ioctl.
>> > +
>> > +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify
>> > +interrupt within their batches after updating the value to have sub-batch
>> > +precision on the wakeup. Each batch can signal an user fence to indicate
>> > +the completion of next level batch. The completion of very first level batch
>> > +needs to be signaled by the command streamer. The user must provide the
>> > +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE
>> > +extension of execbuff ioctl, so that KMD can setup the command streamer to
>> > +signal it.
>> > +
>> > +User/Memory fence can also be supplied to the kernel driver to signal/wake up
>> > +the user process after completion of an asynchronous operation.
>> > +
>> > +When VM_BIND ioctl was provided with a user/memory fence via the
>> > +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion
>> > +of binding of that mapping. All async binds/unbinds are serialized, hence
>> > +signaling of user/memory fence also indicate the completion of all previous
>> > +binds/unbinds.
>> > +
>> > +This feature will be derived from the below original work:
>> > +https://patchwork.freedesktop.org/patch/349417/
>>
>> This is 1:1 tied to long running compute mode contexts (which in the uapi
>> doc must reference the endless amounts of bikeshed summary we have in the
>> docs about indefinite fences).
>>
>> I'd put this into a new section about compute and userspace memory fences
>> support, with this and the next chapter ...
>> > +
>> > +
>> > +VM_BIND use cases
>> > +==================
>>
>> ... and then make this section here focus entirely on additional vm_bind
>> use-cases that we'll be adding later on. Which doesn't need to go into any
>> details, it's just justification for why we want to build the world on top
>> of vm_bind.
>>
>> > +
>> > +Long running Compute contexts
>> > +------------------------------
>> > +Usage of dma-fence expects that they complete in reasonable amount of time.
>> > +Compute on the other hand can be long running. Hence it is appropriate for
>> > +compute to use user/memory fence and dma-fence usage will be limited to
>> > +in-kernel consumption only. This requires an execbuff uapi extension to pass
>> > +in user fence. Compute must opt-in for this mechanism with
>> > +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
>> > +
>> > +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence
>> > +and implicit dependency setting is not allowed on long running contexts.
>> > +
>> > +Where GPU page faults are not available, kernel driver upon buffer invalidation
>> > +will initiate a suspend (preemption) of long running context with a dma-fence
>> > +attached to it. And upon completion of that suspend fence, finish the
>> > +invalidation, revalidate the BO and then resume the compute context. This is
>> > +done by having a per-context fence (called suspend fence) proxying as
>> > +i915_request fence. This suspend fence is enabled when there is a wait on it,
>> > +which triggers the context preemption.
>> > +
>> > +This is much easier to support with VM_BIND compared to the current heavier
>> > +execbuff path resource attachment.
>>
>> There's a bunch of tricky code around compute mode context support, like
>> the preempt ctx fence (or suspend fence or whatever you want to call it),
>> and the resume work. And I think that code should be shared across
>> drivers.
>>
>> I think the right place to put this is into drm/sched, somewhere attached
>> to the drm_sched_entity structure. I expect i915 folks to collaborate with
>> amd and ideally also get amdkfd to adopt the same thing if possible. At
>> least Christian has mentioned in the past that he's a bit unhappy about
>> how this works.
>>
>> Also drm/sched has dependency tracking, which will be needed to pipeline
>> context resume operations. That needs to be used instead of i915-gem
>> inventing yet another dependency tracking data structure (it already has 3
>> and that's roughly 3 too many).
>>
>> This means compute mode support and userspace memory fences are blocked on
>> the drm/sched conversion, but *eh* add it to the list of reasons for why
>> drm/sched needs to happen.
>>
>> Also since we only have support for compute mode ctx in our internal tree
>> with the guc scheduler backend anyway, and the first conversion target is
>> the guc backend, I don't think this actually holds up a lot of the code.
>>
>> > +Low Latency Submission
>> > +-----------------------
>> > +Allows compute UMD to directly submit GPU jobs instead of through execbuff
>> > +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
>>
>> This is really just a special case of compute mode contexts, I think I'd
>> include that in there, but explain better what it requires (i.e. vm_bind
>> not being synchronized against execbuf).
>>
>> > +
>> > +Debugger
>> > +---------
>> > +With debug event interface user space process (debugger) is able to keep track
>> > +of and act upon resources created by another process (debuggee) and attached
>> > +to GPU via vm_bind interface.
>> > +
>> > +Mesa/Valkun
>> > +------------
>> > +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving
>> > +performance. For Vulkan it should be straightforward to use VM_BIND.
>> > +For Iris implicit buffer tracking must be implemented before we can harness
>> > +VM_BIND benefits. With increasing GPU hardware performance reducing CPU
>> > +overhead becomes more important.
>>
>> Just to clarify, I don't think we can land vm_bind into upstream if it
>> doesn't work 100% for vk. There's a bit much "can" instead of "will in
>> this section".
>>
>> > +
>> > +Page level hints settings
>> > +--------------------------
>> > +VM_BIND allows any hints setting per mapping instead of per BO.
>> > +Possible hints include read-only, placement and atomicity.
>> > +Sub-BO level placement hint will be even more relevant with
>> > +upcoming GPU on-demand page fault support.
>> > +
>> > +Page level Cache/CLOS settings
>> > +-------------------------------
>> > +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
>> > +
>> > +Shared Virtual Memory (SVM) support
>> > +------------------------------------
>> > +VM_BIND interface can be used to map system memory directly (without gem BO
>> > +abstraction) using the HMM interface.
>>
>> Userptr is absent here (and it's not the same as svm, at least on
>> discrete), and this is needed for the initial version since otherwise vk
>> can't use it because we're not at feature parity.
>>
>> Irc discussions by Maarten and Dave came up with the idea that maybe
>> userptr for vm_bind should work _without_ any gem bo as backing storage,
>> since that guarantees that people don't come up with funny ideas like
>> trying to share such bo across process or mmap it and other nonsense which
>> just doesn't work.
>>
>> > +
>> > +
>> > +Broder i915 cleanups
>> > +=====================
>> > +Supporting this whole new vm_bind mode of binding which comes with its own
>> > +usecases to support and the locking requirements requires proper integration
>> > +with the existing i915 driver. This calls for some broader i915 driver
>> > +cleanups/simplifications for maintainability of the driver going forward.
>> > +Here are few things identified and are being looked into.
>> > +
>> > +- Make pagetable allocations evictable and manage them similar to VM_BIND
>> > +  mapped objects. Page table pages are similar to persistent mappings of a
>> > +  VM (difference here are that the page table pages will not
>> > +  have an i915_vma structure and after swapping pages back in, parent page
>> > +  link needs to be updated).
>>
>> See above, but I think this should be included as part of the initial
>> vm_bind push.
>>
>> > +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
>> > +  do not use it and complexity it brings in is probably more than the
>> > +  performance advantage we get in legacy execbuff case.
>> > +- Remove vma->open_count counting
>> > +- Remove i915_vma active reference tracking. Instead use underlying BO's
>> > +  dma-resv fence list to determine if a i915_vma is active or not.
>>
>> So this is a complete mess, and really should not exist. I think it needs
>> to be removed before we try to make i915_vma even more complex by adding
>> vm_bind.
>>
>> The other thing I've been pondering here is that vm_bind is really
>> completely different from legacy vm structures for a lot of reasons:
>> - no relocation or softpin handling, which means vm_bind has no reason to
>>   ever look at the i915_vma structure in execbuf code. Unfortunately
>>   execbuf has been rewritten to be vma instead of obj centric, so it's a
>>   100% mismatch
>>
>> - vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
>>   that because the kernel manages the virtual address space fully. Again
>>   ideally that entire vma_move_to_active code and everything related to it
>>   would simply not exist.
>>
>> - similar on the eviction side, the rules are quite different: For vm_bind
>>   we never tear down the vma, instead it's just moved to the list of
>>   evicted vma. Legacy vm have no need for all these additional lists, so
>>   another huge confusion.
>>
>> - if the refcount is done correctly for vm_bind we wouldn't need the
>>   tricky code in the bo close paths. Unfortunately legacy vm with
>>   relocations and softpin require that vma are only a weak reference, so
>>   that cannot be removed.
>>
>> - there's also a ton of special cases for ggtt handling, like the
>>   different views (for display, partial views for mmap), but also the
>>   gen2/3 alignment and padding requirements which vm_bind never needs.
>>
>> I think the right thing here is to massively split the implementation
>> behind some solid vm/vma abstraction, with a base clase for vm and vma
>> which _only_ has the pieces which both vm_bind and the legacy vm stuff
>> needs. But it's a bit tricky to get there. I think a workable path would
>> be:
>> - Add a new base class to both i915_address_space and i915_vma, which
>>   starts out empty.
>>
>> - As vm_bind code lands, move things that vm_bind code needs into these
>>   base classes
>>
>> - The goal should be that these base classes are a stand-alone library
>>   that other drivers could reuse. Like we've done with the buddy
>>   allocator, which first moved from i915-gem to i915-ttm, and which amd
>>   now moved to drm/ttm for reuse by amdgpu. Ideally other drivers
>>   interested in adding something like vm_bind should be involved from the
>>   start (or maybe the entire thing reused in amdgpu, they're looking at
>>   vk sparse binding support too or at least have perf issues I think).
>>
>> - Locking must be the same across all implemntations, otherwise it's
>>   really not an abstract. i915 screwed this up terribly by having
>>   different locking rules for ppgtt and ggtt, which is just nonsense.
>>
>> - The legacy specific code needs to be extracted as much as possible and
>>   shoved into separate files. In execbuf this means we need to get back to
>>   object centric flow, and the slowpaths need to become a lot simpler
>>   again (Maarten has cleaned up some of this, but there's still a silly
>>   amount of hacks in there with funny layering).
>>
>> - I think if stuff like the vma eviction details (list movement and
>>   locking and refcounting of the underlying object)
>>
>> > +
>> > +These can be worked upon after intitial vm_bind support is added.
>>
>> I don't think that works, given how badly i915-gem team screwed up in
>> other places. And those places had to be fixed by adopting shared code
>> like ttm. Plus there's already a huge unfulffiled promise pending with the
>> drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
>>
>> Cheers, Daniel
>>
>> > +
>> > +
>> > +UAPI
>> > +=====
>> > +Uapi definiton can be found here:
>> > +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
>> > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
>> > index 91e93a705230..7d10c36b268d 100644
>> > --- a/Documentation/gpu/rfc/index.rst
>> > +++ b/Documentation/gpu/rfc/index.rst
>> > @@ -23,3 +23,7 @@ host such documentation:
>> >  .. toctree::
>> >
>> >      i915_scheduler.rst
>> > +
>> > +.. toctree::
>> > +
>> > +    i915_vm_bind.rst
>> > --
>> > 2.21.0.rc0.32.g243a4c7e27
>> >
>>
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch
>
>
>
>-- 
>Daniel Vetter
>Software Engineer, Intel Corporation
>http://blog.ffwll.ch