[PATCH v6 2/8] drm/ttm: Add ttm_bo_access

Wed Nov 13 08:37:04 UTC 2024

Am 12.11.24 um 17:33 schrieb Thomas Hellström:
> [SNIP]
>>>> This has been extensively discussed already, but was expected to
>>>> really
>>>> only be needed for low-on-memory scenarios. However it now seems
>>>> like
>>>> the need is much earlier due to the random userptr page joining
>>>> by
>>>> core
>>>> mm.
>>> Just to clarify here:
>>>    
>>> In Long-Running mode with recoverable pagefaults enabled we don't
>>> have
>>> any preempt-fences, but rather just zap the PTEs pointing to the
>>> affected memory and flush TLB. So from a memory resource POW a
>>> breakpoint should be safe, and no mmu notifier nor shrinker will be
>>> blocked.
>> That sounds like a HMM based approach which would clearly work.
>>
>> But where is that? I don't see any HMM based approach anywhere in the
>> XE
>> driver.
> This is a mode that uses recoverable pagefaults to fault either full
> userptr or full bos, and used with DRM_XE_VM_CRATE_FLAG_FAULT_MODE.
> (not SVM)!
>
> userptrs in xe are bo-less, and using the vm's resv, but otherwise
> using hmm similar to amdgpu: xe_hmm.c

Yeah, I have seen that one.

> fault servicing:
> xe_gt_pagefault.c
>
> PTE zapping on eviction and notifier:
> xe_vm_invalidate_vma(), xe_vm.c

Ah, that was the stuff I was missing.

So the implementation in xe_preempt_fence.c is just for graphics 
submissions? That would make the whole thing much easier to handle.

The only remaining question I can then see is if long running 
submissions with DRM_XE_VM_CRATE_FLAG_FAULT_MODE could potentially block 
graphics submissions without this flag from accessing the hardware?

Thanks a lot for pointing this out,
Christian.

>
> Thanks,
> Thomas
>
>> Regards,
>> Christian.
>>
>>> Nor will there be any jobs with published dma-fences depending on
>>> the
>>> job blocked either temporarily by a pagefault or long-term by a
>>> debugger breakpoint.
>>>
>>> /Thomas
>>>
>>>
>>>> If that is done and the memory pre-empt fence is serviced even
>>>> for
>>>> debuggable contexts, do you have further concerns with the
>>>> presented
>>>> approach
>>>> from dma-buf and drm/sched perspective?
>>>>
>>>> Regards, Joonas
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>
>>>>>           This means that a breakpoint or core dump doesn't halt
>>>>> GPU
>>>>> threads, but
>>>>>           rather suspends them. E.g. all running wave data is
>>>>> collected into a state
>>>>>           bag which can be restored later on.
>>>>>
>>>>>           I was under the impression that those long running
>>>>> compute
>>>>> threads do
>>>>>           exactly that, but when the hardware can't switch out
>>>>> the
>>>>> GPU thread/process
>>>>>           while in a break then that isn't the case.
>>>>>
>>>>>           As long as you don't find a way to avoid that this
>>>>> patch
>>>>> set is a pretty
>>>>>           clear NAK from my side as DMA-buf and TTM maintainer.
>>>>>
>>>>>
>>>>>       I believe this is addressed above.
>>>>>
>>>>>       Matt
>>>>>
>>>>>
>>>>>           What might work is to keep the submission on the
>>>>> hardware
>>>>> in the break state
>>>>>           but forbid any memory access. This way you can signal
>>>>> your
>>>>> preemption fence
>>>>>           even when the hardware isn't made available.
>>>>>
>>>>>           Before you continue XE setups a new pre-emption fence
>>>>> and
>>>>> makes sure that
>>>>>           all page tables etc... are up to date.
>>>>>
>>>>>           Could be tricky to get this right if completion fence
>>>>> based
>>>>> submissions are
>>>>>           mixed in as well, but that gives you at least a
>>>>> direction
>>>>> you could
>>>>>           potentially go.
>>>>>
>>>>>           Regards,
>>>>>           Christian.
>>>>>
>>>>>
>>>>>               Regards, Joonas
>>>>>
>>>>>
>>>>>                   Regards,
>>>>>                   Christian.
>>>>>
>>>>>
>>>>>                       Some wash-up thoughts from me below, but
>>>>> consider them fairly irrelevant
>>>>>                       since I think the main driver for these
>>>>> big
>>>>> questions here should be
>>>>>                       gdb/userspace.
>>>>>
>>>>>
>>>>>                           Quoting Christian König (2024-11-07
>>>>> 11:44:33)
>>>>>
>>>>>                               Am 06.11.24 um 18:00 schrieb
>>>>> Matthew
>>>>> Brost:
>>>>>
>>>>>                                     [SNIP]
>>>>>
>>>>>                                     This is not a generic
>>>>> interface
>>>>> that anyone can freely access. The same
>>>>>                                     permissions used by ptrace
>>>>> are
>>>>> checked when opening such an interface.
>>>>>                                     See [1] [2].
>>>>>
>>>>>                                    
>>>>> [1]
>>>>> https://patchwork.freedesktop.org/patch/617470/?series=136572&r
>>>>> e
>>>>> v=2
>>>>>                                    
>>>>> [2]
>>>>> https://patchwork.freedesktop.org/patch/617471/?series=136572&r
>>>>> e
>>>>> v=2
>>>>>
>>>>>
>>>>>                               Thanks a lot for those pointers,
>>>>> that
>>>>> is exactly what I was looking for.
>>>>>
>>>>>                               And yeah, it is what I feared. You
>>>>> are
>>>>> re-implementing existing functionality,
>>>>>                               but see below.
>>>>>
>>>>>                           Could you elaborate on what this
>>>>> "existing
>>>>> functionality" exactly is?
>>>>>                           I do not think this functionality
>>>>> exists at
>>>>> this time.
>>>>>
>>>>>                           The EU debugging architecture for Xe
>>>>> specifically avoids the need for GDB
>>>>>                           to attach with ptrace to the CPU
>>>>> process or
>>>>> interfere with the CPU process for
>>>>>                           the debugging via parasitic threads or
>>>>> so.
>>>>>
>>>>>                           Debugger connection is opened to the
>>>>> DRM
>>>>> driver for given PID (which uses the
>>>>>                           ptrace may access check for now) after
>>>>> which the all DRM client of that
>>>>>                           PID are exposed to the debugger
>>>>> process.
>>>>>
>>>>>                           What we want to expose via that
>>>>> debugger
>>>>> connection is the ability for GDB to
>>>>>                           read/write the different GPU VM
>>>>> address
>>>>> spaces (ppGTT for Intel GPUs) just like
>>>>>                           the EU threads would see them. Note
>>>>> that
>>>>> the layout of the ppGTT is
>>>>>                           completely up to the userspace driver
>>>>> to
>>>>> setup and is mostly only partially
>>>>>                           equal to the CPU address space.
>>>>>
>>>>>                           Specifically as part of
>>>>> reading/writing the
>>>>> ppGTT for debugging purposes,
>>>>>                           there are deep flushes needed: for
>>>>> example
>>>>> flushing instruction cache
>>>>>                           when adding/removing breakpoints.
>>>>>
>>>>>                           Maybe that will explain the
>>>>> background. I
>>>>> elaborate on this at the end some more.
>>>>>
>>>>>
>>>>>                                             kmap/vmap are used
>>>>> everywhere in the DRM subsystem to access BOs, so I’m
>>>>>                                             failing to see the
>>>>> problem with adding a simple helper based on existing
>>>>>                                             code.
>>>>>
>>>>>                                         What#s possible and
>>>>> often
>>>>> done is to do kmap/vmap if you need to implement a
>>>>>                                         CPU copy for scanout for
>>>>> example or for copying/validating command buffers.
>>>>>                                         But that usually
>>>>> requires
>>>>> accessing the whole BO and has separate security
>>>>>                                         checks.
>>>>>
>>>>>                                         When you want to access
>>>>> only
>>>>> a few bytes of a BO that sounds massively like
>>>>>                                         a peek/poke like
>>>>> interface
>>>>> and we have already rejected that more than once.
>>>>>                                         There even used to be
>>>>> standardized GEM IOCTLs for that which have been
>>>>>                                         removed by now.
>>>>>
>>>>>                           Referring to the explanation at top:
>>>>> These
>>>>> IOCTL are not for the debugging target
>>>>>                           process to issue. The peek/poke
>>>>> interface
>>>>> is specifically for GDB only
>>>>>                           to facilitate the emulation of memory
>>>>> reads/writes on the GPU address
>>>>>                           space as they were done by EUs
>>>>> themselves.
>>>>> And to recap: for modifying
>>>>>                           instructions for example (add/remove
>>>>> breakpoint), extra level of cache flushing is
>>>>>                           needed which is not available to
>>>>> regular
>>>>> userspace.
>>>>>
>>>>>                           I specifically discussed with Sima on
>>>>> the
>>>>> difference before moving forward with this
>>>>>                           design originally. If something has
>>>>> changed
>>>>> since then, I'm of course happy to rediscuss.
>>>>>
>>>>>                           However, if this code can't be added,
>>>>> not
>>>>> sure how we would ever be able
>>>>>                           to implement core dumps for GPU
>>>>> threads/memory?
>>>>>
>>>>>
>>>>>                                         If you need to access
>>>>> BOs
>>>>> which are placed in not CPU accessible memory then
>>>>>                                         implement the access
>>>>> callback
>>>>> for ptrace, see amdgpu_ttm_access_memory for
>>>>>                                         an example how to do
>>>>> this.
>>>>>
>>>>>                           As also mentioned above, we don't work
>>>>> via
>>>>> ptrace at all when it comes
>>>>>                           to debugging the EUs. The only thing
>>>>> used
>>>>> for now is the ptrace_may_access to
>>>>>                           implement similar access restrictions
>>>>> as
>>>>> ptrace has. This can be changed
>>>>>                           to something else if needed.
>>>>>
>>>>>
>>>>>                                     Ptrace access via
>>>>> vm_operations_struct.access → ttm_bo_vm_access.
>>>>>
>>>>>                                     This series renames
>>>>> ttm_bo_vm_access to ttm_bo_access, with no code changes.
>>>>>
>>>>>                                     The above function accesses
>>>>> a BO
>>>>> via kmap if it is in SYSTEM / TT,
>>>>>                                     which is existing code.
>>>>>
>>>>>                                     This function is only
>>>>> exposed to
>>>>> user space via ptrace permissions.
>>>>>
>>>>>                           Maybe this sentence is what caused the
>>>>> confusion.
>>>>>
>>>>>                           Userspace is never exposed with
>>>>> peek/poke
>>>>> interface, only the debugger
>>>>>                           connection which is its own FD.
>>>>>
>>>>>
>>>>>                                     In this series, we implement
>>>>> a
>>>>> function [3] similar to
>>>>>                                     amdgpu_ttm_access_memory for
>>>>> the
>>>>> TTM vfunc access_memory. What is
>>>>>                                     missing is non-visible CPU
>>>>> memory
>>>>> access, similar to
>>>>>                                    
>>>>> amdgpu_ttm_access_memory_sdma.
>>>>> This will be addressed in a follow-up and
>>>>>                                     was omitted in this series
>>>>> given
>>>>> its complexity.
>>>>>
>>>>>                                     So, this looks more or less
>>>>> identical to AMD's ptrace implementation,
>>>>>                                     but in GPU address space.
>>>>> Again,
>>>>> I fail to see what the problem is here.
>>>>>                                     What am I missing?
>>>>>
>>>>>
>>>>>                               The main question is why can't you
>>>>> use
>>>>> the existing interfaces directly?
>>>>>
>>>>>                           We're not working on the CPU address
>>>>> space
>>>>> or BOs. We're working
>>>>>                           strictly on the GPU address space as
>>>>> would
>>>>> be seen by an EU thread if it
>>>>>                           accessed address X.
>>>>>
>>>>>
>>>>>                               Additional to the peek/poke
>>>>> interface
>>>>> of ptrace Linux has the pidfd_getfd
>>>>>                               system call, see
>>>>> herehttps://man7.org/linux/man-pages/man2/pidfd_getfd.2.html.
>>>>>
>>>>>                               The pidfd_getfd() allows to dup()
>>>>> the
>>>>> render node file descriptor into your gdb
>>>>>                               process. That in turn gives you
>>>>> all the
>>>>> access you need from gdb, including
>>>>>                               mapping BOs and command submission
>>>>> on
>>>>> behalf of the application.
>>>>>
>>>>>                           We're not operating on the CPU address
>>>>> space nor are we operating on BOs
>>>>>                           (there is no concept of BO in the EU
>>>>> debug
>>>>> interface). Each VMA in the VM
>>>>>                           could come from anywhere, only the
>>>>> start
>>>>> address and size matter. And
>>>>>                           neither do we need to interfere with
>>>>> the
>>>>> command submission of the
>>>>>                           process under debug.
>>>>>
>>>>>
>>>>>                               As far as I can see that allows
>>>>> for the
>>>>> same functionality as the eudebug
>>>>>                               interface, just without any driver
>>>>> specific code messing with ptrace
>>>>>                               permissions and peek/poke
>>>>> interfaces.
>>>>>
>>>>>                               So the question is still why do
>>>>> you
>>>>> need the whole eudebug interface in the
>>>>>                               first place? I might be missing
>>>>> something, but that seems to be superfluous
>>>>>                               from a high level view.
>>>>>
>>>>>                           Recapping from above. It is to allow
>>>>> the
>>>>> debugging of EU threads per DRM
>>>>>                           client, completely independent of the
>>>>> CPU
>>>>> process. If ptrace_may_acces
>>>>>                           is the sore point, we could consider
>>>>> other
>>>>> permission checks, too. There
>>>>>                           is no other connection to ptrace in
>>>>> this
>>>>> architecture as single
>>>>>                           permission check to know if PID is
>>>>> fair
>>>>> game to access by debugger
>>>>>                           process.
>>>>>
>>>>>                           Why no parasitic thread or ptrace:
>>>>> Going
>>>>> forward, binding the EU debugging to
>>>>>                           the DRM client would also pave way for
>>>>> being able to extend core kernel generated
>>>>>                           core dump with each DRM client's EU
>>>>> thread/memory dump. We have similar
>>>>>                           feature called "Offline core dump"
>>>>> enabled
>>>>> in the downstream public
>>>>>                           trees for i915, where we currently
>>>>> attach
>>>>> the EU thread dump to i915 error state
>>>>>                           and then later combine i915 error
>>>>> state
>>>>> with CPU core dump file with a
>>>>>                           tool.
>>>>>
>>>>>                           This is relatively little amount of
>>>>> extra
>>>>> code, as this baseline series
>>>>>                           already introduces GDB the ability to
>>>>> perform the necessary actions.
>>>>>                           It's just the matter of kernel driver
>>>>> calling: "stop all threads", then
>>>>>                           copying the memory map and memory
>>>>> contents
>>>>> for GPU threads, just like is
>>>>>                           done for CPU threads.
>>>>>
>>>>>                           With parasitic thread injection, not
>>>>> sure
>>>>> if there is such way forward,
>>>>>                           as it would seem to require to inject
>>>>> quite
>>>>> abit more logic to core kernel?
>>>>>
>>>>>
>>>>>                               It's true that the AMD KFD part
>>>>> has
>>>>> still similar functionality, but that is
>>>>>                               because of the broken KFD design
>>>>> of
>>>>> tying driver state to the CPU process
>>>>>                               (which makes it inaccessible for
>>>>> gdb
>>>>> even with imported render node fd).
>>>>>
>>>>>                               Both Sima and I (and partially
>>>>> Dave as
>>>>> well) have pushed back on the KFD
>>>>>                               approach. And the long term plan
>>>>> is to
>>>>> get rid of such device driver specific
>>>>>                               interface which re-implement
>>>>> existing
>>>>> functionality just differently.
>>>>>
>>>>>                           Recapping, this series is not adding
>>>>> it
>>>>> back. The debugger connection
>>>>>                           is a separate FD from the DRM one,
>>>>> with
>>>>> separate IOCTL set. We don't allow
>>>>>                           the DRM FD any new operations based on
>>>>> ptrace is attached or not. We
>>>>>                           don't ever do that check even.
>>>>>
>>>>>                           We only restrict the opening of the
>>>>> debugger connection to given PID with
>>>>>                           ptrace_may_access check for now. That
>>>>> can
>>>>> be changed to something else,
>>>>>                           if necessary.
>>>>>
>>>>>                       Yeah I think unnecessarily tying gpu
>>>>> processes
>>>>> to cpu processes is a bad
>>>>>                       thing, least because even today all the
>>>>> svm
>>>>> discussions we have still hit
>>>>>                       clear use-cases, where a 1:1 match is not
>>>>> wanted (like multiple gpu svm
>>>>>                       sections with offsets). Not even speaking
>>>>> of
>>>>> all the gpu usecases where
>>>>>                       the gpu vm space is still entirely
>>>>> independent
>>>>> of the cpu side.
>>>>>
>>>>>                       So that's why I think this entirely
>>>>> separate
>>>>> approach looks like the right
>>>>>                       one, with ptrace_may_access as the access
>>>>> control check to make sure we
>>>>>                       match ptrace on the cpu side.
>>>>>
>>>>>                       But there's very obviously a bikeshed to
>>>>> be had
>>>>> on what the actual uapi
>>>>>                       should look like, especially how gdb opens
>>>>> up a
>>>>> gpu debug access fd. But I
>>>>>                       also think that's not much on drm to
>>>>> decide,
>>>>> but whatever gdb wants. And
>>>>>                       then we aim for some consistency on that
>>>>> lookup/access control part
>>>>>                       (ideally, I might be missing some reasons
>>>>> why
>>>>> this is a bad idea) across
>>>>>                       drm drivers.
>>>>>
>>>>>
>>>>>                               So you need to have a really
>>>>> really
>>>>> good explanation why the eudebug interface
>>>>>                               is actually necessary.
>>>>>
>>>>>                           TL;DR The main point is to decouple
>>>>> the
>>>>> debugging of the EU workloads from the
>>>>>                           debugging of the CPU process. This
>>>>> avoids
>>>>> the interference with the CPU process with
>>>>>                           parasitic thread injection. Further
>>>>> this
>>>>> also allows generating a core dump
>>>>>                           without any GDB connected. There are
>>>>> also
>>>>> many other smaller pros/cons
>>>>>                           which can be discussed but for the
>>>>> context
>>>>> of this patch, this is the
>>>>>                           main one.
>>>>>
>>>>>                           So unlike parasitic thread injection,
>>>>> we
>>>>> don't unlock any special IOCTL for
>>>>>                           the process under debug to be
>>>>> performed by
>>>>> the parasitic thread, but we
>>>>>                           allow the minimal set of operations to
>>>>> be
>>>>> performed by GDB as if those were
>>>>>                           done on the EUs themselves.
>>>>>
>>>>>                           One can think of it like the minimal
>>>>> subset
>>>>> of ptrace but for EU threads,
>>>>>                           not the CPU threads. And thus,
>>>>> building on
>>>>> this it's possible to extend
>>>>>                           the core kernel generated core dumps
>>>>> with
>>>>> DRM specific extension which
>>>>>                           would contain the EU thread/memory
>>>>> dump.
>>>>>
>>>>>                       It might be good to document (in that
>>>>> debugging
>>>>> doc patch probably) why
>>>>>                       thread injection is not a great option,
>>>>> and why
>>>>> the tradeoffs for
>>>>>                       debugging are different than for for
>>>>> checkpoint/restore, where with CRIU
>>>>>                       we landed on doing most of this in
>>>>> userspace,
>>>>> and often requiring
>>>>>                       injection threads to make it all work.
>>>>>
>>>>>                       Cheers, Sima
>>>>>
>>>>>
>>>>>                           Regards, Joonas
>>>>>
>>>>>
>>>>>                               Regards,
>>>>>                               Christian.
>>>>>
>>>>>
>>>>>
>>>>>                                     Matt
>>>>>
>>>>>                                    
>>>>> [3]
>>>>> https://patchwork.freedesktop.org/patch/622520/?series=140200&r
>>>>> e
>>>>> v=6
>>>>>
>>>>>
>>>>>                                         Regards,
>>>>>                                         Christian.
>>>>>
>>>>>
>>>>>                                             Matt
>>>>>
>>>>>
>>>>>                                                 Regards,
>>>>>                                                 Christian.
>>>>>
>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20241113/38fb95b5/attachment-0001.htm>