[PATCH v6 2/8] drm/ttm: Add ttm_bo_access

Mon Nov 11 11:34:12 UTC 2024

Am 11.11.24 um 11:10 schrieb Simona Vetter:
> On Mon, Nov 11, 2024 at 10:00:17AM +0200, Joonas Lahtinen wrote:
>> Back from some time off and will try to answer below.
>>
>> Adding Dave and Sima as this topic has been previously discussed to some
>> extent and will be good to reach common understanding about what the
>> series is trying to do and what is the difference to the AMD debugging
>> model.
> I chatted about this thread a bit on irc with folks, and I think an
> orthogonal issue is the question, what should be in ttm-utils? I've asked
> Matt to type up a DOC patch once we have some consensus, since imo the
> somewhat lackluster documentation situation for ttm is also somewhat a
> cause for these big threads on various different topics. Aside from the
> fact that gpu memory management is just hard.
>
> On the uapi/design aspect, I think this would serve well with a patch to
> drm-uapi.rst that adds a debugging section? At least once we have some
> rough consensus across drivers, and more importantly userspace in the form
> of gdb upstream (at least I'm not aware of any other upstream debugger
> patches, I think amd's rocm stuff is also gdb-only).

Yeah that seems to be a really good idea. Similar design ideas came up 
AMD internally as well but where dropped after pointing people to 
pidfd_getfd().

But the bigger problem seems to be that the design doesn't seems to take 
the dma_fence requirements into account.

In other words attaching gdb to a pid seems to stop the GPU thread of 
this pid without waiting for the XE preemption nor end of operation fence.

I mean if the GPU threads are preempted that could work, but yeah not 
like this :)

Regards,
Christian.

>
> Some wash-up thoughts from me below, but consider them fairly irrelevant
> since I think the main driver for these big questions here should be
> gdb/userspace.
>
>> Quoting Christian König (2024-11-07 11:44:33)
>>> Am 06.11.24 um 18:00 schrieb Matthew Brost:
>>>
>>>      [SNIP]
>>>
>>>      This is not a generic interface that anyone can freely access. The same
>>>      permissions used by ptrace are checked when opening such an interface.
>>>      See [1] [2].
>>>
>>>      [1] https://patchwork.freedesktop.org/patch/617470/?series=136572&rev=2
>>>      [2] https://patchwork.freedesktop.org/patch/617471/?series=136572&rev=2
>>>
>>>
>>> Thanks a lot for those pointers, that is exactly what I was looking for.
>>>
>>> And yeah, it is what I feared. You are re-implementing existing functionality,
>>> but see below.
>> Could you elaborate on what this "existing functionality" exactly is?
>> I do not think this functionality exists at this time.
>>
>> The EU debugging architecture for Xe specifically avoids the need for GDB
>> to attach with ptrace to the CPU process or interfere with the CPU process for
>> the debugging via parasitic threads or so.
>>
>> Debugger connection is opened to the DRM driver for given PID (which uses the
>> ptrace may access check for now) after which the all DRM client of that
>> PID are exposed to the debugger process.
>>
>> What we want to expose via that debugger connection is the ability for GDB to
>> read/write the different GPU VM address spaces (ppGTT for Intel GPUs) just like
>> the EU threads would see them. Note that the layout of the ppGTT is
>> completely up to the userspace driver to setup and is mostly only partially
>> equal to the CPU address space.
>>
>> Specifically as part of reading/writing the ppGTT for debugging purposes,
>> there are deep flushes needed: for example flushing instruction cache
>> when adding/removing breakpoints.
>>
>> Maybe that will explain the background. I elaborate on this at the end some more.
>>
>>>              kmap/vmap are used everywhere in the DRM subsystem to access BOs, so I’m
>>>              failing to see the problem with adding a simple helper based on existing
>>>              code.
>>>
>>>          What#s possible and often done is to do kmap/vmap if you need to implement a
>>>          CPU copy for scanout for example or for copying/validating command buffers.
>>>          But that usually requires accessing the whole BO and has separate security
>>>          checks.
>>>
>>>          When you want to access only a few bytes of a BO that sounds massively like
>>>          a peek/poke like interface and we have already rejected that more than once.
>>>          There even used to be standardized GEM IOCTLs for that which have been
>>>          removed by now.
>> Referring to the explanation at top: These IOCTL are not for the debugging target
>> process to issue. The peek/poke interface is specifically for GDB only
>> to facilitate the emulation of memory reads/writes on the GPU address
>> space as they were done by EUs themselves. And to recap: for modifying
>> instructions for example (add/remove breakpoint), extra level of cache flushing is
>> needed which is not available to regular userspace.
>>
>> I specifically discussed with Sima on the difference before moving forward with this
>> design originally. If something has changed since then, I'm of course happy to rediscuss.
>>
>> However, if this code can't be added, not sure how we would ever be able
>> to implement core dumps for GPU threads/memory?
>>
>>>          If you need to access BOs which are placed in not CPU accessible memory then
>>>          implement the access callback for ptrace, see amdgpu_ttm_access_memory for
>>>          an example how to do this.
>> As also mentioned above, we don't work via ptrace at all when it comes
>> to debugging the EUs. The only thing used for now is the ptrace_may_access to
>> implement similar access restrictions as ptrace has. This can be changed
>> to something else if needed.
>>
>>>      Ptrace access via vm_operations_struct.access → ttm_bo_vm_access.
>>>
>>>      This series renames ttm_bo_vm_access to ttm_bo_access, with no code changes.
>>>
>>>      The above function accesses a BO via kmap if it is in SYSTEM / TT,
>>>      which is existing code.
>>>
>>>      This function is only exposed to user space via ptrace permissions.
>> Maybe this sentence is what caused the confusion.
>>
>> Userspace is never exposed with peek/poke interface, only the debugger
>> connection which is its own FD.
>>
>>>      In this series, we implement a function [3] similar to
>>>      amdgpu_ttm_access_memory for the TTM vfunc access_memory. What is
>>>      missing is non-visible CPU memory access, similar to
>>>      amdgpu_ttm_access_memory_sdma. This will be addressed in a follow-up and
>>>      was omitted in this series given its complexity.
>>>
>>>      So, this looks more or less identical to AMD's ptrace implementation,
>>>      but in GPU address space. Again, I fail to see what the problem is here.
>>>      What am I missing?
>>>
>>>
>>> The main question is why can't you use the existing interfaces directly?
>> We're not working on the CPU address space or BOs. We're working
>> strictly on the GPU address space as would be seen by an EU thread if it
>> accessed address X.
>>
>>> Additional to the peek/poke interface of ptrace Linux has the pidfd_getfd
>>> system call, see here https://man7.org/linux/man-pages/man2/pidfd_getfd.2.html.
>>>
>>> The pidfd_getfd() allows to dup() the render node file descriptor into your gdb
>>> process. That in turn gives you all the access you need from gdb, including
>>> mapping BOs and command submission on behalf of the application.
>> We're not operating on the CPU address space nor are we operating on BOs
>> (there is no concept of BO in the EU debug interface). Each VMA in the VM
>> could come from anywhere, only the start address and size matter. And
>> neither do we need to interfere with the command submission of the
>> process under debug.
>>
>>> As far as I can see that allows for the same functionality as the eudebug
>>> interface, just without any driver specific code messing with ptrace
>>> permissions and peek/poke interfaces.
>>>
>>> So the question is still why do you need the whole eudebug interface in the
>>> first place? I might be missing something, but that seems to be superfluous
>>> from a high level view.
>> Recapping from above. It is to allow the debugging of EU threads per DRM
>> client, completely independent of the CPU process. If ptrace_may_acces
>> is the sore point, we could consider other permission checks, too. There
>> is no other connection to ptrace in this architecture as single
>> permission check to know if PID is fair game to access by debugger
>> process.
>>
>> Why no parasitic thread or ptrace: Going forward, binding the EU debugging to
>> the DRM client would also pave way for being able to extend core kernel generated
>> core dump with each DRM client's EU thread/memory dump. We have similar
>> feature called "Offline core dump" enabled in the downstream public
>> trees for i915, where we currently attach the EU thread dump to i915 error state
>> and then later combine i915 error state with CPU core dump file with a
>> tool.
>>
>> This is relatively little amount of extra code, as this baseline series
>> already introduces GDB the ability to perform the necessary actions.
>> It's just the matter of kernel driver calling: "stop all threads", then
>> copying the memory map and memory contents for GPU threads, just like is
>> done for CPU threads.
>>
>> With parasitic thread injection, not sure if there is such way forward,
>> as it would seem to require to inject quite abit more logic to core kernel?
>>
>>> It's true that the AMD KFD part has still similar functionality, but that is
>>> because of the broken KFD design of tying driver state to the CPU process
>>> (which makes it inaccessible for gdb even with imported render node fd).
>>>
>>> Both Sima and I (and partially Dave as well) have pushed back on the KFD
>>> approach. And the long term plan is to get rid of such device driver specific
>>> interface which re-implement existing functionality just differently.
>> Recapping, this series is not adding it back. The debugger connection
>> is a separate FD from the DRM one, with separate IOCTL set. We don't allow
>> the DRM FD any new operations based on ptrace is attached or not. We
>> don't ever do that check even.
>>
>> We only restrict the opening of the debugger connection to given PID with
>> ptrace_may_access check for now. That can be changed to something else,
>> if necessary.
> Yeah I think unnecessarily tying gpu processes to cpu processes is a bad
> thing, least because even today all the svm discussions we have still hit
> clear use-cases, where a 1:1 match is not wanted (like multiple gpu svm
> sections with offsets). Not even speaking of all the gpu usecases where
> the gpu vm space is still entirely independent of the cpu side.
>
> So that's why I think this entirely separate approach looks like the right
> one, with ptrace_may_access as the access control check to make sure we
> match ptrace on the cpu side.
>
> But there's very obviously a bikeshed to be had on what the actual uapi
> should look like, especially how gdb opens up a gpu debug access fd. But I
> also think that's not much on drm to decide, but whatever gdb wants. And
> then we aim for some consistency on that lookup/access control part
> (ideally, I might be missing some reasons why this is a bad idea) across
> drm drivers.
>
>>> So you need to have a really really good explanation why the eudebug interface
>>> is actually necessary.
>> TL;DR The main point is to decouple the debugging of the EU workloads from the
>> debugging of the CPU process. This avoids the interference with the CPU process with
>> parasitic thread injection. Further this also allows generating a core dump
>> without any GDB connected. There are also many other smaller pros/cons
>> which can be discussed but for the context of this patch, this is the
>> main one.
>>
>> So unlike parasitic thread injection, we don't unlock any special IOCTL for
>> the process under debug to be performed by the parasitic thread, but we
>> allow the minimal set of operations to be performed by GDB as if those were
>> done on the EUs themselves.
>>
>> One can think of it like the minimal subset of ptrace but for EU threads,
>> not the CPU threads. And thus, building on this it's possible to extend
>> the core kernel generated core dumps with DRM specific extension which
>> would contain the EU thread/memory dump.
> It might be good to document (in that debugging doc patch probably) why
> thread injection is not a great option, and why the tradeoffs for
> debugging are different than for for checkpoint/restore, where with CRIU
> we landed on doing most of this in userspace, and often requiring
> injection threads to make it all work.
>
> Cheers, Sima
>
>> Regards, Joonas
>>
>>> Regards,
>>> Christian.
>>>
>>>
>>>
>>>      Matt
>>>
>>>      [3] https://patchwork.freedesktop.org/patch/622520/?series=140200&rev=6
>>>
>>>
>>>          Regards,
>>>          Christian.
>>>
>>>
>>>              Matt
>>>
>>>
>>>                  Regards,
>>>                  Christian.
>>>