[PATCH v6 2/8] drm/ttm: Add ttm_bo_access
Christian König
christian.koenig at amd.com
Wed Nov 13 08:37:04 UTC 2024
Am 12.11.24 um 17:33 schrieb Thomas Hellström:
> [SNIP]
>>>> This has been extensively discussed already, but was expected to
>>>> really
>>>> only be needed for low-on-memory scenarios. However it now seems
>>>> like
>>>> the need is much earlier due to the random userptr page joining
>>>> by
>>>> core
>>>> mm.
>>> Just to clarify here:
>>>
>>> In Long-Running mode with recoverable pagefaults enabled we don't
>>> have
>>> any preempt-fences, but rather just zap the PTEs pointing to the
>>> affected memory and flush TLB. So from a memory resource POW a
>>> breakpoint should be safe, and no mmu notifier nor shrinker will be
>>> blocked.
>> That sounds like a HMM based approach which would clearly work.
>>
>> But where is that? I don't see any HMM based approach anywhere in the
>> XE
>> driver.
> This is a mode that uses recoverable pagefaults to fault either full
> userptr or full bos, and used with DRM_XE_VM_CRATE_FLAG_FAULT_MODE.
> (not SVM)!
>
> userptrs in xe are bo-less, and using the vm's resv, but otherwise
> using hmm similar to amdgpu: xe_hmm.c
Yeah, I have seen that one.
> fault servicing:
> xe_gt_pagefault.c
>
> PTE zapping on eviction and notifier:
> xe_vm_invalidate_vma(), xe_vm.c
Ah, that was the stuff I was missing.
So the implementation in xe_preempt_fence.c is just for graphics
submissions? That would make the whole thing much easier to handle.
The only remaining question I can then see is if long running
submissions with DRM_XE_VM_CRATE_FLAG_FAULT_MODE could potentially block
graphics submissions without this flag from accessing the hardware?
Thanks a lot for pointing this out,
Christian.
>
> Thanks,
> Thomas
>
>> Regards,
>> Christian.
>>
>>> Nor will there be any jobs with published dma-fences depending on
>>> the
>>> job blocked either temporarily by a pagefault or long-term by a
>>> debugger breakpoint.
>>>
>>> /Thomas
>>>
>>>
>>>> If that is done and the memory pre-empt fence is serviced even
>>>> for
>>>> debuggable contexts, do you have further concerns with the
>>>> presented
>>>> approach
>>>> from dma-buf and drm/sched perspective?
>>>>
>>>> Regards, Joonas
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>
>>>>> This means that a breakpoint or core dump doesn't halt
>>>>> GPU
>>>>> threads, but
>>>>> rather suspends them. E.g. all running wave data is
>>>>> collected into a state
>>>>> bag which can be restored later on.
>>>>>
>>>>> I was under the impression that those long running
>>>>> compute
>>>>> threads do
>>>>> exactly that, but when the hardware can't switch out
>>>>> the
>>>>> GPU thread/process
>>>>> while in a break then that isn't the case.
>>>>>
>>>>> As long as you don't find a way to avoid that this
>>>>> patch
>>>>> set is a pretty
>>>>> clear NAK from my side as DMA-buf and TTM maintainer.
>>>>>
>>>>>
>>>>> I believe this is addressed above.
>>>>>
>>>>> Matt
>>>>>
>>>>>
>>>>> What might work is to keep the submission on the
>>>>> hardware
>>>>> in the break state
>>>>> but forbid any memory access. This way you can signal
>>>>> your
>>>>> preemption fence
>>>>> even when the hardware isn't made available.
>>>>>
>>>>> Before you continue XE setups a new pre-emption fence
>>>>> and
>>>>> makes sure that
>>>>> all page tables etc... are up to date.
>>>>>
>>>>> Could be tricky to get this right if completion fence
>>>>> based
>>>>> submissions are
>>>>> mixed in as well, but that gives you at least a
>>>>> direction
>>>>> you could
>>>>> potentially go.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>
>>>>> Regards, Joonas
>>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>
>>>>> Some wash-up thoughts from me below, but
>>>>> consider them fairly irrelevant
>>>>> since I think the main driver for these
>>>>> big
>>>>> questions here should be
>>>>> gdb/userspace.
>>>>>
>>>>>
>>>>> Quoting Christian König (2024-11-07
>>>>> 11:44:33)
>>>>>
>>>>> Am 06.11.24 um 18:00 schrieb
>>>>> Matthew
>>>>> Brost:
>>>>>
>>>>> [SNIP]
>>>>>
>>>>> This is not a generic
>>>>> interface
>>>>> that anyone can freely access. The same
>>>>> permissions used by ptrace
>>>>> are
>>>>> checked when opening such an interface.
>>>>> See [1] [2].
>>>>>
>>>>>
>>>>> [1]
>>>>> https://patchwork.freedesktop.org/patch/617470/?series=136572&r
>>>>> e
>>>>> v=2
>>>>>
>>>>> [2]
>>>>> https://patchwork.freedesktop.org/patch/617471/?series=136572&r
>>>>> e
>>>>> v=2
>>>>>
>>>>>
>>>>> Thanks a lot for those pointers,
>>>>> that
>>>>> is exactly what I was looking for.
>>>>>
>>>>> And yeah, it is what I feared. You
>>>>> are
>>>>> re-implementing existing functionality,
>>>>> but see below.
>>>>>
>>>>> Could you elaborate on what this
>>>>> "existing
>>>>> functionality" exactly is?
>>>>> I do not think this functionality
>>>>> exists at
>>>>> this time.
>>>>>
>>>>> The EU debugging architecture for Xe
>>>>> specifically avoids the need for GDB
>>>>> to attach with ptrace to the CPU
>>>>> process or
>>>>> interfere with the CPU process for
>>>>> the debugging via parasitic threads or
>>>>> so.
>>>>>
>>>>> Debugger connection is opened to the
>>>>> DRM
>>>>> driver for given PID (which uses the
>>>>> ptrace may access check for now) after
>>>>> which the all DRM client of that
>>>>> PID are exposed to the debugger
>>>>> process.
>>>>>
>>>>> What we want to expose via that
>>>>> debugger
>>>>> connection is the ability for GDB to
>>>>> read/write the different GPU VM
>>>>> address
>>>>> spaces (ppGTT for Intel GPUs) just like
>>>>> the EU threads would see them. Note
>>>>> that
>>>>> the layout of the ppGTT is
>>>>> completely up to the userspace driver
>>>>> to
>>>>> setup and is mostly only partially
>>>>> equal to the CPU address space.
>>>>>
>>>>> Specifically as part of
>>>>> reading/writing the
>>>>> ppGTT for debugging purposes,
>>>>> there are deep flushes needed: for
>>>>> example
>>>>> flushing instruction cache
>>>>> when adding/removing breakpoints.
>>>>>
>>>>> Maybe that will explain the
>>>>> background. I
>>>>> elaborate on this at the end some more.
>>>>>
>>>>>
>>>>> kmap/vmap are used
>>>>> everywhere in the DRM subsystem to access BOs, so I’m
>>>>> failing to see the
>>>>> problem with adding a simple helper based on existing
>>>>> code.
>>>>>
>>>>> What#s possible and
>>>>> often
>>>>> done is to do kmap/vmap if you need to implement a
>>>>> CPU copy for scanout for
>>>>> example or for copying/validating command buffers.
>>>>> But that usually
>>>>> requires
>>>>> accessing the whole BO and has separate security
>>>>> checks.
>>>>>
>>>>> When you want to access
>>>>> only
>>>>> a few bytes of a BO that sounds massively like
>>>>> a peek/poke like
>>>>> interface
>>>>> and we have already rejected that more than once.
>>>>> There even used to be
>>>>> standardized GEM IOCTLs for that which have been
>>>>> removed by now.
>>>>>
>>>>> Referring to the explanation at top:
>>>>> These
>>>>> IOCTL are not for the debugging target
>>>>> process to issue. The peek/poke
>>>>> interface
>>>>> is specifically for GDB only
>>>>> to facilitate the emulation of memory
>>>>> reads/writes on the GPU address
>>>>> space as they were done by EUs
>>>>> themselves.
>>>>> And to recap: for modifying
>>>>> instructions for example (add/remove
>>>>> breakpoint), extra level of cache flushing is
>>>>> needed which is not available to
>>>>> regular
>>>>> userspace.
>>>>>
>>>>> I specifically discussed with Sima on
>>>>> the
>>>>> difference before moving forward with this
>>>>> design originally. If something has
>>>>> changed
>>>>> since then, I'm of course happy to rediscuss.
>>>>>
>>>>> However, if this code can't be added,
>>>>> not
>>>>> sure how we would ever be able
>>>>> to implement core dumps for GPU
>>>>> threads/memory?
>>>>>
>>>>>
>>>>> If you need to access
>>>>> BOs
>>>>> which are placed in not CPU accessible memory then
>>>>> implement the access
>>>>> callback
>>>>> for ptrace, see amdgpu_ttm_access_memory for
>>>>> an example how to do
>>>>> this.
>>>>>
>>>>> As also mentioned above, we don't work
>>>>> via
>>>>> ptrace at all when it comes
>>>>> to debugging the EUs. The only thing
>>>>> used
>>>>> for now is the ptrace_may_access to
>>>>> implement similar access restrictions
>>>>> as
>>>>> ptrace has. This can be changed
>>>>> to something else if needed.
>>>>>
>>>>>
>>>>> Ptrace access via
>>>>> vm_operations_struct.access → ttm_bo_vm_access.
>>>>>
>>>>> This series renames
>>>>> ttm_bo_vm_access to ttm_bo_access, with no code changes.
>>>>>
>>>>> The above function accesses
>>>>> a BO
>>>>> via kmap if it is in SYSTEM / TT,
>>>>> which is existing code.
>>>>>
>>>>> This function is only
>>>>> exposed to
>>>>> user space via ptrace permissions.
>>>>>
>>>>> Maybe this sentence is what caused the
>>>>> confusion.
>>>>>
>>>>> Userspace is never exposed with
>>>>> peek/poke
>>>>> interface, only the debugger
>>>>> connection which is its own FD.
>>>>>
>>>>>
>>>>> In this series, we implement
>>>>> a
>>>>> function [3] similar to
>>>>> amdgpu_ttm_access_memory for
>>>>> the
>>>>> TTM vfunc access_memory. What is
>>>>> missing is non-visible CPU
>>>>> memory
>>>>> access, similar to
>>>>>
>>>>> amdgpu_ttm_access_memory_sdma.
>>>>> This will be addressed in a follow-up and
>>>>> was omitted in this series
>>>>> given
>>>>> its complexity.
>>>>>
>>>>> So, this looks more or less
>>>>> identical to AMD's ptrace implementation,
>>>>> but in GPU address space.
>>>>> Again,
>>>>> I fail to see what the problem is here.
>>>>> What am I missing?
>>>>>
>>>>>
>>>>> The main question is why can't you
>>>>> use
>>>>> the existing interfaces directly?
>>>>>
>>>>> We're not working on the CPU address
>>>>> space
>>>>> or BOs. We're working
>>>>> strictly on the GPU address space as
>>>>> would
>>>>> be seen by an EU thread if it
>>>>> accessed address X.
>>>>>
>>>>>
>>>>> Additional to the peek/poke
>>>>> interface
>>>>> of ptrace Linux has the pidfd_getfd
>>>>> system call, see
>>>>> herehttps://man7.org/linux/man-pages/man2/pidfd_getfd.2.html.
>>>>>
>>>>> The pidfd_getfd() allows to dup()
>>>>> the
>>>>> render node file descriptor into your gdb
>>>>> process. That in turn gives you
>>>>> all the
>>>>> access you need from gdb, including
>>>>> mapping BOs and command submission
>>>>> on
>>>>> behalf of the application.
>>>>>
>>>>> We're not operating on the CPU address
>>>>> space nor are we operating on BOs
>>>>> (there is no concept of BO in the EU
>>>>> debug
>>>>> interface). Each VMA in the VM
>>>>> could come from anywhere, only the
>>>>> start
>>>>> address and size matter. And
>>>>> neither do we need to interfere with
>>>>> the
>>>>> command submission of the
>>>>> process under debug.
>>>>>
>>>>>
>>>>> As far as I can see that allows
>>>>> for the
>>>>> same functionality as the eudebug
>>>>> interface, just without any driver
>>>>> specific code messing with ptrace
>>>>> permissions and peek/poke
>>>>> interfaces.
>>>>>
>>>>> So the question is still why do
>>>>> you
>>>>> need the whole eudebug interface in the
>>>>> first place? I might be missing
>>>>> something, but that seems to be superfluous
>>>>> from a high level view.
>>>>>
>>>>> Recapping from above. It is to allow
>>>>> the
>>>>> debugging of EU threads per DRM
>>>>> client, completely independent of the
>>>>> CPU
>>>>> process. If ptrace_may_acces
>>>>> is the sore point, we could consider
>>>>> other
>>>>> permission checks, too. There
>>>>> is no other connection to ptrace in
>>>>> this
>>>>> architecture as single
>>>>> permission check to know if PID is
>>>>> fair
>>>>> game to access by debugger
>>>>> process.
>>>>>
>>>>> Why no parasitic thread or ptrace:
>>>>> Going
>>>>> forward, binding the EU debugging to
>>>>> the DRM client would also pave way for
>>>>> being able to extend core kernel generated
>>>>> core dump with each DRM client's EU
>>>>> thread/memory dump. We have similar
>>>>> feature called "Offline core dump"
>>>>> enabled
>>>>> in the downstream public
>>>>> trees for i915, where we currently
>>>>> attach
>>>>> the EU thread dump to i915 error state
>>>>> and then later combine i915 error
>>>>> state
>>>>> with CPU core dump file with a
>>>>> tool.
>>>>>
>>>>> This is relatively little amount of
>>>>> extra
>>>>> code, as this baseline series
>>>>> already introduces GDB the ability to
>>>>> perform the necessary actions.
>>>>> It's just the matter of kernel driver
>>>>> calling: "stop all threads", then
>>>>> copying the memory map and memory
>>>>> contents
>>>>> for GPU threads, just like is
>>>>> done for CPU threads.
>>>>>
>>>>> With parasitic thread injection, not
>>>>> sure
>>>>> if there is such way forward,
>>>>> as it would seem to require to inject
>>>>> quite
>>>>> abit more logic to core kernel?
>>>>>
>>>>>
>>>>> It's true that the AMD KFD part
>>>>> has
>>>>> still similar functionality, but that is
>>>>> because of the broken KFD design
>>>>> of
>>>>> tying driver state to the CPU process
>>>>> (which makes it inaccessible for
>>>>> gdb
>>>>> even with imported render node fd).
>>>>>
>>>>> Both Sima and I (and partially
>>>>> Dave as
>>>>> well) have pushed back on the KFD
>>>>> approach. And the long term plan
>>>>> is to
>>>>> get rid of such device driver specific
>>>>> interface which re-implement
>>>>> existing
>>>>> functionality just differently.
>>>>>
>>>>> Recapping, this series is not adding
>>>>> it
>>>>> back. The debugger connection
>>>>> is a separate FD from the DRM one,
>>>>> with
>>>>> separate IOCTL set. We don't allow
>>>>> the DRM FD any new operations based on
>>>>> ptrace is attached or not. We
>>>>> don't ever do that check even.
>>>>>
>>>>> We only restrict the opening of the
>>>>> debugger connection to given PID with
>>>>> ptrace_may_access check for now. That
>>>>> can
>>>>> be changed to something else,
>>>>> if necessary.
>>>>>
>>>>> Yeah I think unnecessarily tying gpu
>>>>> processes
>>>>> to cpu processes is a bad
>>>>> thing, least because even today all the
>>>>> svm
>>>>> discussions we have still hit
>>>>> clear use-cases, where a 1:1 match is not
>>>>> wanted (like multiple gpu svm
>>>>> sections with offsets). Not even speaking
>>>>> of
>>>>> all the gpu usecases where
>>>>> the gpu vm space is still entirely
>>>>> independent
>>>>> of the cpu side.
>>>>>
>>>>> So that's why I think this entirely
>>>>> separate
>>>>> approach looks like the right
>>>>> one, with ptrace_may_access as the access
>>>>> control check to make sure we
>>>>> match ptrace on the cpu side.
>>>>>
>>>>> But there's very obviously a bikeshed to
>>>>> be had
>>>>> on what the actual uapi
>>>>> should look like, especially how gdb opens
>>>>> up a
>>>>> gpu debug access fd. But I
>>>>> also think that's not much on drm to
>>>>> decide,
>>>>> but whatever gdb wants. And
>>>>> then we aim for some consistency on that
>>>>> lookup/access control part
>>>>> (ideally, I might be missing some reasons
>>>>> why
>>>>> this is a bad idea) across
>>>>> drm drivers.
>>>>>
>>>>>
>>>>> So you need to have a really
>>>>> really
>>>>> good explanation why the eudebug interface
>>>>> is actually necessary.
>>>>>
>>>>> TL;DR The main point is to decouple
>>>>> the
>>>>> debugging of the EU workloads from the
>>>>> debugging of the CPU process. This
>>>>> avoids
>>>>> the interference with the CPU process with
>>>>> parasitic thread injection. Further
>>>>> this
>>>>> also allows generating a core dump
>>>>> without any GDB connected. There are
>>>>> also
>>>>> many other smaller pros/cons
>>>>> which can be discussed but for the
>>>>> context
>>>>> of this patch, this is the
>>>>> main one.
>>>>>
>>>>> So unlike parasitic thread injection,
>>>>> we
>>>>> don't unlock any special IOCTL for
>>>>> the process under debug to be
>>>>> performed by
>>>>> the parasitic thread, but we
>>>>> allow the minimal set of operations to
>>>>> be
>>>>> performed by GDB as if those were
>>>>> done on the EUs themselves.
>>>>>
>>>>> One can think of it like the minimal
>>>>> subset
>>>>> of ptrace but for EU threads,
>>>>> not the CPU threads. And thus,
>>>>> building on
>>>>> this it's possible to extend
>>>>> the core kernel generated core dumps
>>>>> with
>>>>> DRM specific extension which
>>>>> would contain the EU thread/memory
>>>>> dump.
>>>>>
>>>>> It might be good to document (in that
>>>>> debugging
>>>>> doc patch probably) why
>>>>> thread injection is not a great option,
>>>>> and why
>>>>> the tradeoffs for
>>>>> debugging are different than for for
>>>>> checkpoint/restore, where with CRIU
>>>>> we landed on doing most of this in
>>>>> userspace,
>>>>> and often requiring
>>>>> injection threads to make it all work.
>>>>>
>>>>> Cheers, Sima
>>>>>
>>>>>
>>>>> Regards, Joonas
>>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>
>>>>>
>>>>> Matt
>>>>>
>>>>>
>>>>> [3]
>>>>> https://patchwork.freedesktop.org/patch/622520/?series=140200&r
>>>>> e
>>>>> v=6
>>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>
>>>>> Matt
>>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20241113/38fb95b5/attachment-0001.htm>
More information about the dri-devel
mailing list