[PATCH v6 2/8] drm/ttm: Add ttm_bo_access
Thomas Hellström
thomas.hellstrom at linux.intel.com
Wed Nov 13 10:44:12 UTC 2024
On Wed, 2024-11-13 at 09:37 +0100, Christian König wrote:
> Am 12.11.24 um 17:33 schrieb Thomas Hellström:
> > [SNIP]
> > > > > This has been extensively discussed already, but was expected
> > > > > to
> > > > > really
> > > > > only be needed for low-on-memory scenarios. However it now
> > > > > seems
> > > > > like
> > > > > the need is much earlier due to the random userptr page
> > > > > joining
> > > > > by
> > > > > core
> > > > > mm.
> > > > Just to clarify here:
> > > >
> > > > In Long-Running mode with recoverable pagefaults enabled we
> > > > don't
> > > > have
> > > > any preempt-fences, but rather just zap the PTEs pointing to
> > > > the
> > > > affected memory and flush TLB. So from a memory resource POW a
> > > > breakpoint should be safe, and no mmu notifier nor shrinker
> > > > will be
> > > > blocked.
> > > That sounds like a HMM based approach which would clearly work.
> > >
> > > But where is that? I don't see any HMM based approach anywhere in
> > > the
> > > XE
> > > driver.
> > This is a mode that uses recoverable pagefaults to fault either
> > full
> > userptr or full bos, and used with DRM_XE_VM_CRATE_FLAG_FAULT_MODE.
> > (not SVM)!
> >
> > userptrs in xe are bo-less, and using the vm's resv, but otherwise
> > using hmm similar to amdgpu: xe_hmm.c
>
> Yeah, I have seen that one.
>
> > fault servicing:
> > xe_gt_pagefault.c
> >
> > PTE zapping on eviction and notifier:
> > xe_vm_invalidate_vma(), xe_vm.c
>
> Ah, that was the stuff I was missing.
>
> So the implementation in xe_preempt_fence.c is just for graphics
> submissions? That would make the whole thing much easier to handle.
Actually it's not, it's intended for long-running mode, but as a
consequence the debugger would be allowed only in fault mode.
>
> The only remaining question I can then see is if long running
> submissions with DRM_XE_VM_CRATE_FLAG_FAULT_MODE could potentially
> block
> graphics submissions without this flag from accessing the hardware?
Yes and no. We have a mechanism in place that allows either only fault
mode jobs or non-faulting jobs on the same, what we call "engine
group".
A pagefault on an engine group would block or hamper progress of other
jobs on that engine group.
So let's say a dma-fence job is submitted to an engine group that is
currently running a faulting job. We'd then need to switch mode of the
engine group and, in the exec ioctl we'd (explicitly without preempt-
fences) preempt the faulting job before submitting the dma-fence job
and publishing its fence. This preemption will incur a delay which is
typically the delay of servicing any outstanding pagefaults. It's not
ideal, but the best we can do, and it doesn't affect core memory
management nor does it affect migration blits.
In the debugger case, this delay could be long due to breakpoints, and
that's why enabling the debugger would sit behind a flag and not
something default (I think this was discussed earlier in the thread).
Still, core memory management would be unaffected, and also ofc the
migration blits are completely independent.
/Thomas
>
> Thanks a lot for pointing this out,
> Christian.
>
> >
> > Thanks,
> > Thomas
> >
> > > Regards,
> > > Christian.
> > >
> > > > Nor will there be any jobs with published dma-fences depending
> > > > on
> > > > the
> > > > job blocked either temporarily by a pagefault or long-term by a
> > > > debugger breakpoint.
> > > >
> > > > /Thomas
> > > >
> > > >
> > > > > If that is done and the memory pre-empt fence is serviced
> > > > > even
> > > > > for
> > > > > debuggable contexts, do you have further concerns with the
> > > > > presented
> > > > > approach
> > > > > from dma-buf and drm/sched perspective?
> > > > >
> > > > > Regards, Joonas
> > > > >
> > > > > > Regards,
> > > > > > Christian.
> > > > > >
> > > > > >
> > > > > > This means that a breakpoint or core dump doesn't
> > > > > > halt
> > > > > > GPU
> > > > > > threads, but
> > > > > > rather suspends them. E.g. all running wave data
> > > > > > is
> > > > > > collected into a state
> > > > > > bag which can be restored later on.
> > > > > >
> > > > > > I was under the impression that those long
> > > > > > running
> > > > > > compute
> > > > > > threads do
> > > > > > exactly that, but when the hardware can't switch
> > > > > > out
> > > > > > the
> > > > > > GPU thread/process
> > > > > > while in a break then that isn't the case.
> > > > > >
> > > > > > As long as you don't find a way to avoid that
> > > > > > this
> > > > > > patch
> > > > > > set is a pretty
> > > > > > clear NAK from my side as DMA-buf and TTM
> > > > > > maintainer.
> > > > > >
> > > > > >
> > > > > > I believe this is addressed above.
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > >
> > > > > > What might work is to keep the submission on the
> > > > > > hardware
> > > > > > in the break state
> > > > > > but forbid any memory access. This way you can
> > > > > > signal
> > > > > > your
> > > > > > preemption fence
> > > > > > even when the hardware isn't made available.
> > > > > >
> > > > > > Before you continue XE setups a new pre-emption
> > > > > > fence
> > > > > > and
> > > > > > makes sure that
> > > > > > all page tables etc... are up to date.
> > > > > >
> > > > > > Could be tricky to get this right if completion
> > > > > > fence
> > > > > > based
> > > > > > submissions are
> > > > > > mixed in as well, but that gives you at least a
> > > > > > direction
> > > > > > you could
> > > > > > potentially go.
> > > > > >
> > > > > > Regards,
> > > > > > Christian.
> > > > > >
> > > > > >
> > > > > > Regards, Joonas
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Christian.
> > > > > >
> > > > > >
> > > > > > Some wash-up thoughts from me below,
> > > > > > but
> > > > > > consider them fairly irrelevant
> > > > > > since I think the main driver for
> > > > > > these
> > > > > > big
> > > > > > questions here should be
> > > > > > gdb/userspace.
> > > > > >
> > > > > >
> > > > > > Quoting Christian König (2024-11-
> > > > > > 07
> > > > > > 11:44:33)
> > > > > >
> > > > > > Am 06.11.24 um 18:00 schrieb
> > > > > > Matthew
> > > > > > Brost:
> > > > > >
> > > > > > [SNIP]
> > > > > >
> > > > > > This is not a generic
> > > > > > interface
> > > > > > that anyone can freely access. The same
> > > > > > permissions used by
> > > > > > ptrace
> > > > > > are
> > > > > > checked when opening such an interface.
> > > > > > See [1] [2].
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > > https://patchwork.freedesktop.org/patch/617470/?series=136572&r
> > > > > > e
> > > > > > v=2
> > > > > >
> > > > > > [2]
> > > > > > https://patchwork.freedesktop.org/patch/617471/?series=136572&r
> > > > > > e
> > > > > > v=2
> > > > > >
> > > > > >
> > > > > > Thanks a lot for those
> > > > > > pointers,
> > > > > > that
> > > > > > is exactly what I was looking for.
> > > > > >
> > > > > > And yeah, it is what I
> > > > > > feared. You
> > > > > > are
> > > > > > re-implementing existing functionality,
> > > > > > but see below.
> > > > > >
> > > > > > Could you elaborate on what this
> > > > > > "existing
> > > > > > functionality" exactly is?
> > > > > > I do not think this functionality
> > > > > > exists at
> > > > > > this time.
> > > > > >
> > > > > > The EU debugging architecture for
> > > > > > Xe
> > > > > > specifically avoids the need for GDB
> > > > > > to attach with ptrace to the CPU
> > > > > > process or
> > > > > > interfere with the CPU process for
> > > > > > the debugging via parasitic
> > > > > > threads or
> > > > > > so.
> > > > > >
> > > > > > Debugger connection is opened to
> > > > > > the
> > > > > > DRM
> > > > > > driver for given PID (which uses the
> > > > > > ptrace may access check for now)
> > > > > > after
> > > > > > which the all DRM client of that
> > > > > > PID are exposed to the debugger
> > > > > > process.
> > > > > >
> > > > > > What we want to expose via that
> > > > > > debugger
> > > > > > connection is the ability for GDB to
> > > > > > read/write the different GPU VM
> > > > > > address
> > > > > > spaces (ppGTT for Intel GPUs) just like
> > > > > > the EU threads would see them.
> > > > > > Note
> > > > > > that
> > > > > > the layout of the ppGTT is
> > > > > > completely up to the userspace
> > > > > > driver
> > > > > > to
> > > > > > setup and is mostly only partially
> > > > > > equal to the CPU address space.
> > > > > >
> > > > > > Specifically as part of
> > > > > > reading/writing the
> > > > > > ppGTT for debugging purposes,
> > > > > > there are deep flushes needed:
> > > > > > for
> > > > > > example
> > > > > > flushing instruction cache
> > > > > > when adding/removing breakpoints.
> > > > > >
> > > > > > Maybe that will explain the
> > > > > > background. I
> > > > > > elaborate on this at the end some more.
> > > > > >
> > > > > >
> > > > > > kmap/vmap are
> > > > > > used
> > > > > > everywhere in the DRM subsystem to access BOs, so I’m
> > > > > > failing to see
> > > > > > the
> > > > > > problem with adding a simple helper based on existing
> > > > > > code.
> > > > > >
> > > > > > What#s possible and
> > > > > > often
> > > > > > done is to do kmap/vmap if you need to implement a
> > > > > > CPU copy for
> > > > > > scanout for
> > > > > > example or for copying/validating command buffers.
> > > > > > But that usually
> > > > > > requires
> > > > > > accessing the whole BO and has separate security
> > > > > > checks.
> > > > > >
> > > > > > When you want to
> > > > > > access
> > > > > > only
> > > > > > a few bytes of a BO that sounds massively like
> > > > > > a peek/poke like
> > > > > > interface
> > > > > > and we have already rejected that more than once.
> > > > > > There even used to
> > > > > > be
> > > > > > standardized GEM IOCTLs for that which have been
> > > > > > removed by now.
> > > > > >
> > > > > > Referring to the explanation at
> > > > > > top:
> > > > > > These
> > > > > > IOCTL are not for the debugging target
> > > > > > process to issue. The peek/poke
> > > > > > interface
> > > > > > is specifically for GDB only
> > > > > > to facilitate the emulation of
> > > > > > memory
> > > > > > reads/writes on the GPU address
> > > > > > space as they were done by EUs
> > > > > > themselves.
> > > > > > And to recap: for modifying
> > > > > > instructions for example
> > > > > > (add/remove
> > > > > > breakpoint), extra level of cache flushing is
> > > > > > needed which is not available to
> > > > > > regular
> > > > > > userspace.
> > > > > >
> > > > > > I specifically discussed with
> > > > > > Sima on
> > > > > > the
> > > > > > difference before moving forward with this
> > > > > > design originally. If something
> > > > > > has
> > > > > > changed
> > > > > > since then, I'm of course happy to rediscuss.
> > > > > >
> > > > > > However, if this code can't be
> > > > > > added,
> > > > > > not
> > > > > > sure how we would ever be able
> > > > > > to implement core dumps for GPU
> > > > > > threads/memory?
> > > > > >
> > > > > >
> > > > > > If you need to
> > > > > > access
> > > > > > BOs
> > > > > > which are placed in not CPU accessible memory then
> > > > > > implement the
> > > > > > access
> > > > > > callback
> > > > > > for ptrace, see amdgpu_ttm_access_memory for
> > > > > > an example how to
> > > > > > do
> > > > > > this.
> > > > > >
> > > > > > As also mentioned above, we don't
> > > > > > work
> > > > > > via
> > > > > > ptrace at all when it comes
> > > > > > to debugging the EUs. The only
> > > > > > thing
> > > > > > used
> > > > > > for now is the ptrace_may_access to
> > > > > > implement similar access
> > > > > > restrictions
> > > > > > as
> > > > > > ptrace has. This can be changed
> > > > > > to something else if needed.
> > > > > >
> > > > > >
> > > > > > Ptrace access via
> > > > > > vm_operations_struct.access → ttm_bo_vm_access.
> > > > > >
> > > > > > This series renames
> > > > > > ttm_bo_vm_access to ttm_bo_access, with no code changes.
> > > > > >
> > > > > > The above function
> > > > > > accesses
> > > > > > a BO
> > > > > > via kmap if it is in SYSTEM / TT,
> > > > > > which is existing code.
> > > > > >
> > > > > > This function is only
> > > > > > exposed to
> > > > > > user space via ptrace permissions.
> > > > > >
> > > > > > Maybe this sentence is what
> > > > > > caused the
> > > > > > confusion.
> > > > > >
> > > > > > Userspace is never exposed with
> > > > > > peek/poke
> > > > > > interface, only the debugger
> > > > > > connection which is its own FD.
> > > > > >
> > > > > >
> > > > > > In this series, we
> > > > > > implement
> > > > > > a
> > > > > > function [3] similar to
> > > > > >
> > > > > > amdgpu_ttm_access_memory for
> > > > > > the
> > > > > > TTM vfunc access_memory. What is
> > > > > > missing is non-visible
> > > > > > CPU
> > > > > > memory
> > > > > > access, similar to
> > > > > >
> > > > > > amdgpu_ttm_access_memory_sdma.
> > > > > > This will be addressed in a follow-up and
> > > > > > was omitted in this
> > > > > > series
> > > > > > given
> > > > > > its complexity.
> > > > > >
> > > > > > So, this looks more or
> > > > > > less
> > > > > > identical to AMD's ptrace implementation,
> > > > > > but in GPU address
> > > > > > space.
> > > > > > Again,
> > > > > > I fail to see what the problem is here.
> > > > > > What am I missing?
> > > > > >
> > > > > >
> > > > > > The main question is why
> > > > > > can't you
> > > > > > use
> > > > > > the existing interfaces directly?
> > > > > >
> > > > > > We're not working on the CPU
> > > > > > address
> > > > > > space
> > > > > > or BOs. We're working
> > > > > > strictly on the GPU address space
> > > > > > as
> > > > > > would
> > > > > > be seen by an EU thread if it
> > > > > > accessed address X.
> > > > > >
> > > > > >
> > > > > > Additional to the peek/poke
> > > > > > interface
> > > > > > of ptrace Linux has the pidfd_getfd
> > > > > > system call, see
> > > > > > here
> > > > > > https://man7.org/linux/man-pages/man2/pidfd_getfd.2.html.
> > > > > >
> > > > > > The pidfd_getfd() allows to
> > > > > > dup()
> > > > > > the
> > > > > > render node file descriptor into your gdb
> > > > > > process. That in turn gives
> > > > > > you
> > > > > > all the
> > > > > > access you need from gdb, including
> > > > > > mapping BOs and command
> > > > > > submission
> > > > > > on
> > > > > > behalf of the application.
> > > > > >
> > > > > > We're not operating on the CPU
> > > > > > address
> > > > > > space nor are we operating on BOs
> > > > > > (there is no concept of BO in the
> > > > > > EU
> > > > > > debug
> > > > > > interface). Each VMA in the VM
> > > > > > could come from anywhere, only
> > > > > > the
> > > > > > start
> > > > > > address and size matter. And
> > > > > > neither do we need to interfere
> > > > > > with
> > > > > > the
> > > > > > command submission of the
> > > > > > process under debug.
> > > > > >
> > > > > >
> > > > > > As far as I can see that
> > > > > > allows
> > > > > > for the
> > > > > > same functionality as the eudebug
> > > > > > interface, just without any
> > > > > > driver
> > > > > > specific code messing with ptrace
> > > > > > permissions and peek/poke
> > > > > > interfaces.
> > > > > >
> > > > > > So the question is still why
> > > > > > do
> > > > > > you
> > > > > > need the whole eudebug interface in the
> > > > > > first place? I might be
> > > > > > missing
> > > > > > something, but that seems to be superfluous
> > > > > > from a high level view.
> > > > > >
> > > > > > Recapping from above. It is to
> > > > > > allow
> > > > > > the
> > > > > > debugging of EU threads per DRM
> > > > > > client, completely independent of
> > > > > > the
> > > > > > CPU
> > > > > > process. If ptrace_may_acces
> > > > > > is the sore point, we could
> > > > > > consider
> > > > > > other
> > > > > > permission checks, too. There
> > > > > > is no other connection to ptrace
> > > > > > in
> > > > > > this
> > > > > > architecture as single
> > > > > > permission check to know if PID
> > > > > > is
> > > > > > fair
> > > > > > game to access by debugger
> > > > > > process.
> > > > > >
> > > > > > Why no parasitic thread or
> > > > > > ptrace:
> > > > > > Going
> > > > > > forward, binding the EU debugging to
> > > > > > the DRM client would also pave
> > > > > > way for
> > > > > > being able to extend core kernel generated
> > > > > > core dump with each DRM client's
> > > > > > EU
> > > > > > thread/memory dump. We have similar
> > > > > > feature called "Offline core
> > > > > > dump"
> > > > > > enabled
> > > > > > in the downstream public
> > > > > > trees for i915, where we
> > > > > > currently
> > > > > > attach
> > > > > > the EU thread dump to i915 error state
> > > > > > and then later combine i915 error
> > > > > > state
> > > > > > with CPU core dump file with a
> > > > > > tool.
> > > > > >
> > > > > > This is relatively little amount
> > > > > > of
> > > > > > extra
> > > > > > code, as this baseline series
> > > > > > already introduces GDB the
> > > > > > ability to
> > > > > > perform the necessary actions.
> > > > > > It's just the matter of kernel
> > > > > > driver
> > > > > > calling: "stop all threads", then
> > > > > > copying the memory map and memory
> > > > > > contents
> > > > > > for GPU threads, just like is
> > > > > > done for CPU threads.
> > > > > >
> > > > > > With parasitic thread injection,
> > > > > > not
> > > > > > sure
> > > > > > if there is such way forward,
> > > > > > as it would seem to require to
> > > > > > inject
> > > > > > quite
> > > > > > abit more logic to core kernel?
> > > > > >
> > > > > >
> > > > > > It's true that the AMD KFD
> > > > > > part
> > > > > > has
> > > > > > still similar functionality, but that is
> > > > > > because of the broken KFD
> > > > > > design
> > > > > > of
> > > > > > tying driver state to the CPU process
> > > > > > (which makes it inaccessible
> > > > > > for
> > > > > > gdb
> > > > > > even with imported render node fd).
> > > > > >
> > > > > > Both Sima and I (and
> > > > > > partially
> > > > > > Dave as
> > > > > > well) have pushed back on the KFD
> > > > > > approach. And the long term
> > > > > > plan
> > > > > > is to
> > > > > > get rid of such device driver specific
> > > > > > interface which re-implement
> > > > > > existing
> > > > > > functionality just differently.
> > > > > >
> > > > > > Recapping, this series is not
> > > > > > adding
> > > > > > it
> > > > > > back. The debugger connection
> > > > > > is a separate FD from the DRM
> > > > > > one,
> > > > > > with
> > > > > > separate IOCTL set. We don't allow
> > > > > > the DRM FD any new operations
> > > > > > based on
> > > > > > ptrace is attached or not. We
> > > > > > don't ever do that check even.
> > > > > >
> > > > > > We only restrict the opening of
> > > > > > the
> > > > > > debugger connection to given PID with
> > > > > > ptrace_may_access check for now.
> > > > > > That
> > > > > > can
> > > > > > be changed to something else,
> > > > > > if necessary.
> > > > > >
> > > > > > Yeah I think unnecessarily tying gpu
> > > > > > processes
> > > > > > to cpu processes is a bad
> > > > > > thing, least because even today all
> > > > > > the
> > > > > > svm
> > > > > > discussions we have still hit
> > > > > > clear use-cases, where a 1:1 match is
> > > > > > not
> > > > > > wanted (like multiple gpu svm
> > > > > > sections with offsets). Not even
> > > > > > speaking
> > > > > > of
> > > > > > all the gpu usecases where
> > > > > > the gpu vm space is still entirely
> > > > > > independent
> > > > > > of the cpu side.
> > > > > >
> > > > > > So that's why I think this entirely
> > > > > > separate
> > > > > > approach looks like the right
> > > > > > one, with ptrace_may_access as the
> > > > > > access
> > > > > > control check to make sure we
> > > > > > match ptrace on the cpu side.
> > > > > >
> > > > > > But there's very obviously a bikeshed
> > > > > > to
> > > > > > be had
> > > > > > on what the actual uapi
> > > > > > should look like, especially how gdb
> > > > > > opens
> > > > > > up a
> > > > > > gpu debug access fd. But I
> > > > > > also think that's not much on drm to
> > > > > > decide,
> > > > > > but whatever gdb wants. And
> > > > > > then we aim for some consistency on
> > > > > > that
> > > > > > lookup/access control part
> > > > > > (ideally, I might be missing some
> > > > > > reasons
> > > > > > why
> > > > > > this is a bad idea) across
> > > > > > drm drivers.
> > > > > >
> > > > > >
> > > > > > So you need to have a really
> > > > > > really
> > > > > > good explanation why the eudebug interface
> > > > > > is actually necessary.
> > > > > >
> > > > > > TL;DR The main point is to
> > > > > > decouple
> > > > > > the
> > > > > > debugging of the EU workloads from the
> > > > > > debugging of the CPU process.
> > > > > > This
> > > > > > avoids
> > > > > > the interference with the CPU process with
> > > > > > parasitic thread injection.
> > > > > > Further
> > > > > > this
> > > > > > also allows generating a core dump
> > > > > > without any GDB connected. There
> > > > > > are
> > > > > > also
> > > > > > many other smaller pros/cons
> > > > > > which can be discussed but for
> > > > > > the
> > > > > > context
> > > > > > of this patch, this is the
> > > > > > main one.
> > > > > >
> > > > > > So unlike parasitic thread
> > > > > > injection,
> > > > > > we
> > > > > > don't unlock any special IOCTL for
> > > > > > the process under debug to be
> > > > > > performed by
> > > > > > the parasitic thread, but we
> > > > > > allow the minimal set of
> > > > > > operations to
> > > > > > be
> > > > > > performed by GDB as if those were
> > > > > > done on the EUs themselves.
> > > > > >
> > > > > > One can think of it like the
> > > > > > minimal
> > > > > > subset
> > > > > > of ptrace but for EU threads,
> > > > > > not the CPU threads. And thus,
> > > > > > building on
> > > > > > this it's possible to extend
> > > > > > the core kernel generated core
> > > > > > dumps
> > > > > > with
> > > > > > DRM specific extension which
> > > > > > would contain the EU
> > > > > > thread/memory
> > > > > > dump.
> > > > > >
> > > > > > It might be good to document (in that
> > > > > > debugging
> > > > > > doc patch probably) why
> > > > > > thread injection is not a great
> > > > > > option,
> > > > > > and why
> > > > > > the tradeoffs for
> > > > > > debugging are different than for for
> > > > > > checkpoint/restore, where with CRIU
> > > > > > we landed on doing most of this in
> > > > > > userspace,
> > > > > > and often requiring
> > > > > > injection threads to make it all
> > > > > > work.
> > > > > >
> > > > > > Cheers, Sima
> > > > > >
> > > > > >
> > > > > > Regards, Joonas
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Christian.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > >
> > > > > > [3]
> > > > > > https://patchwork.freedesktop.org/patch/622520/?series=140200&r
> > > > > > e
> > > > > > v=6
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Christian.
> > > > > >
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Christian.
> > > > > >
> > > > > >
> > > > > >
More information about the dri-devel
mailing list