[PATCH v6 2/8] drm/ttm: Add ttm_bo_access

Tue Nov 12 16:25:24 UTC 2024

Am 12.11.24 um 17:22 schrieb Thomas Hellström:
> On Tue, 2024-11-12 at 15:41 +0200, Joonas Lahtinen wrote:
>> (+ Thomas)
>>
>> Quoting Christian König (2024-11-12 11:23:36)
>>> Am 11.11.24 um 23:45 schrieb Matthew Brost:
>>>
>>>      [SNIP]
>>>
>>>              So I think only way to allow interactive debugging is
>>> to avoid the
>>>              dma_fences. Curious to hear if there are ideas for
>>> otherwise.
>>>
>>>          You need to guarantee somehow that the process is taken
>>> from the hardware so
>>>          that the preemption fence can signal.
>>>
>>>
>>>      Our preemption fences have this functionality.
>>>
>>>      A preemption fence issues a suspend execution command to the
>>> firmware. The
>>>      firmware, in turn, attempts to preempt the workload. If it
>>> doesn't respond
>>>      within a specified period, it resets the hardware queue, sends
>>> a message to KMD,
>>>      bans the software queue, and signals the preemption fence.
>>>
>>>      We provide even more protection than that. If, for some reason,
>>> the firmware
>>>      doesn't respond within a longer timeout period, the KMD
>>> performs a device reset,
>>>      ban the offending software queue(s), and will signal the
>>> preemption fences.
>>>
>>>      This flow remains the same whether a debugger is attached or,
>>> for example, a
>>>      user submits a 10-minute non-preemptable workload. In either
>>> case, other
>>>      processes are guaranteed to make forward progress.
>>>
>>>
>>> Yeah that is pretty much the same argumentation I have heard before
>>> and it
>>> turned out to not be working.
>>>
>>>
>>>      The example above illustrates the memory oversubscription case,
>>> where two
>>>      processes are using 51% of the memory.
>>>
>>>
>>> That isn't even necessary. We have seen applications dying just
>>> because the
>>> core memory management tried to join back small pages into huge
>>> pages in an
>>> userptr.
>>>
>>> That the core memory management jumps in and requests that the pre-
>>> emption
>>> fence signals can happen all the time.
>> Ouch. Does there happen to be a known reproducer for this behavior or
>> maybe
>> bug report?
>>
>>> You can mitigate that a bit, Fedora for example disables joining
>>> back small
>>> pages into huge pages by default for example and we even had people
>>> suggesting
>>> to use mprotect() so that userptrs VMAs don't fork() any more
>>> (which is of
>>> course completely illegal).
>>>
>>> But my long term take away is that you can't block all causes of
>>> sudden
>>> requests to let a pre-emption fence signal.
>> I think this problem equally applies to the LR-workloads like the EU
>> debugging ones.
>>
>>>      Another preemption scenario involves two processes sharing
>>> hardware resources.
>>>      Our firmware follows the same flow here. If an LR workload is
>>> using a hardware
>>>      resource and a DMA-fence workload is waiting, and if the LR
>>> workload doesn't
>>>      preempt the in a timely manner, the firmware issues a hardware
>>> reset, notifies
>>>      KMD, and bans the LR software queue. The DMA-fence workload
>>> then can make
>>>      forward progress
>>>
>>>      With the above in mind, this is why I say that if a user tries
>>> to run a game and
>>>      a non-preemptable LR workload, either oversubscribing memory or
>>> sharing hardware
>>>      resources, it is unlikely to work well. However, I don't think
>>> this is a common
>>>      use case. I would expect that when a debugger is open, it is
>>> typically by a
>>>      power user who knows how to disable other GPU tasks (e.g., by
>>> enabling software
>>>      rendering or using a machine without any display).
>>>
>>>      Given this, please to reconsider your position.
>>>
>>>
>>> The key point here is that this isn't stable, you can do that as a
>>> tech demo
>>> but it can always be that debugging an application just randomly
>>> dies. And
>>> believe me AMD has tried this to a rather extreme extend as well.
>> It's not really only limited to the debuggable applications at all,
>> the
>> normal LR workloads are equally impacted as far as I understand. Just
>> harder to catch the issue with LR-workloads if the pre-emption fence
>> signaling is sporadic.
>>
>>> What you could potentially work is to taint the kernel and make
>>> sure that this
>>> function is only available to user who absolutely know what they
>>> are doing.
>>>
>>> But I would say we can only allow that if all other options have
>>> been exercised
>>> and doing it like this is really the only option left.
>> It sounds like servicing the memory pre-empt fence by stealing the
>> pages from underneath the workload would be the way to resolve this
>> issue.
>>
>> This has been extensively discussed already, but was expected to
>> really
>> only be needed for low-on-memory scenarios. However it now seems like
>> the need is much earlier due to the random userptr page joining by
>> core
>> mm.
> Just to clarify here:
>   
> In Long-Running mode with recoverable pagefaults enabled we don't have
> any preempt-fences, but rather just zap the PTEs pointing to the
> affected memory and flush TLB. So from a memory resource POW a
> breakpoint should be safe, and no mmu notifier nor shrinker will be
> blocked.

That sounds like a HMM based approach which would clearly work.

But where is that? I don't see any HMM based approach anywhere in the XE 
driver.

Regards,
Christian.

>
> Nor will there be any jobs with published dma-fences depending on the
> job blocked either temporarily by a pagefault or long-term by a
> debugger breakpoint.
>
> /Thomas
>
>
>> If that is done and the memory pre-empt fence is serviced even for
>> debuggable contexts, do you have further concerns with the presented
>> approach
>> from dma-buf and drm/sched perspective?
>>
>> Regards, Joonas
>>
>>> Regards,
>>> Christian.
>>>
>>>
>>>          This means that a breakpoint or core dump doesn't halt GPU
>>> threads, but
>>>          rather suspends them. E.g. all running wave data is
>>> collected into a state
>>>          bag which can be restored later on.
>>>
>>>          I was under the impression that those long running compute
>>> threads do
>>>          exactly that, but when the hardware can't switch out the
>>> GPU thread/process
>>>          while in a break then that isn't the case.
>>>
>>>          As long as you don't find a way to avoid that this patch
>>> set is a pretty
>>>          clear NAK from my side as DMA-buf and TTM maintainer.
>>>
>>>
>>>      I believe this is addressed above.
>>>
>>>      Matt
>>>
>>>
>>>          What might work is to keep the submission on the hardware
>>> in the break state
>>>          but forbid any memory access. This way you can signal your
>>> preemption fence
>>>          even when the hardware isn't made available.
>>>
>>>          Before you continue XE setups a new pre-emption fence and
>>> makes sure that
>>>          all page tables etc... are up to date.
>>>
>>>          Could be tricky to get this right if completion fence based
>>> submissions are
>>>          mixed in as well, but that gives you at least a direction
>>> you could
>>>          potentially go.
>>>
>>>          Regards,
>>>          Christian.
>>>
>>>
>>>              Regards, Joonas
>>>
>>>
>>>                  Regards,
>>>                  Christian.
>>>
>>>
>>>                      Some wash-up thoughts from me below, but
>>> consider them fairly irrelevant
>>>                      since I think the main driver for these big
>>> questions here should be
>>>                      gdb/userspace.
>>>
>>>
>>>                          Quoting Christian König (2024-11-07
>>> 11:44:33)
>>>
>>>                              Am 06.11.24 um 18:00 schrieb Matthew
>>> Brost:
>>>
>>>                                    [SNIP]
>>>
>>>                                    This is not a generic interface
>>> that anyone can freely access. The same
>>>                                    permissions used by ptrace are
>>> checked when opening such an interface.
>>>                                    See [1] [2].
>>>
>>>                                   
>>> [1]https://patchwork.freedesktop.org/patch/617470/?series=136572&re
>>> v=2
>>>                                   
>>> [2]https://patchwork.freedesktop.org/patch/617471/?series=136572&re
>>> v=2
>>>
>>>
>>>                              Thanks a lot for those pointers, that
>>> is exactly what I was looking for.
>>>
>>>                              And yeah, it is what I feared. You are
>>> re-implementing existing functionality,
>>>                              but see below.
>>>
>>>                          Could you elaborate on what this "existing
>>> functionality" exactly is?
>>>                          I do not think this functionality exists at
>>> this time.
>>>
>>>                          The EU debugging architecture for Xe
>>> specifically avoids the need for GDB
>>>                          to attach with ptrace to the CPU process or
>>> interfere with the CPU process for
>>>                          the debugging via parasitic threads or so.
>>>
>>>                          Debugger connection is opened to the DRM
>>> driver for given PID (which uses the
>>>                          ptrace may access check for now) after
>>> which the all DRM client of that
>>>                          PID are exposed to the debugger process.
>>>
>>>                          What we want to expose via that debugger
>>> connection is the ability for GDB to
>>>                          read/write the different GPU VM address
>>> spaces (ppGTT for Intel GPUs) just like
>>>                          the EU threads would see them. Note that
>>> the layout of the ppGTT is
>>>                          completely up to the userspace driver to
>>> setup and is mostly only partially
>>>                          equal to the CPU address space.
>>>
>>>                          Specifically as part of reading/writing the
>>> ppGTT for debugging purposes,
>>>                          there are deep flushes needed: for example
>>> flushing instruction cache
>>>                          when adding/removing breakpoints.
>>>
>>>                          Maybe that will explain the background. I
>>> elaborate on this at the end some more.
>>>
>>>
>>>                                            kmap/vmap are used
>>> everywhere in the DRM subsystem to access BOs, so I’m
>>>                                            failing to see the
>>> problem with adding a simple helper based on existing
>>>                                            code.
>>>
>>>                                        What#s possible and often
>>> done is to do kmap/vmap if you need to implement a
>>>                                        CPU copy for scanout for
>>> example or for copying/validating command buffers.
>>>                                        But that usually requires
>>> accessing the whole BO and has separate security
>>>                                        checks.
>>>
>>>                                        When you want to access only
>>> a few bytes of a BO that sounds massively like
>>>                                        a peek/poke like interface
>>> and we have already rejected that more than once.
>>>                                        There even used to be
>>> standardized GEM IOCTLs for that which have been
>>>                                        removed by now.
>>>
>>>                          Referring to the explanation at top: These
>>> IOCTL are not for the debugging target
>>>                          process to issue. The peek/poke interface
>>> is specifically for GDB only
>>>                          to facilitate the emulation of memory
>>> reads/writes on the GPU address
>>>                          space as they were done by EUs themselves.
>>> And to recap: for modifying
>>>                          instructions for example (add/remove
>>> breakpoint), extra level of cache flushing is
>>>                          needed which is not available to regular
>>> userspace.
>>>
>>>                          I specifically discussed with Sima on the
>>> difference before moving forward with this
>>>                          design originally. If something has changed
>>> since then, I'm of course happy to rediscuss.
>>>
>>>                          However, if this code can't be added, not
>>> sure how we would ever be able
>>>                          to implement core dumps for GPU
>>> threads/memory?
>>>
>>>
>>>                                        If you need to access BOs
>>> which are placed in not CPU accessible memory then
>>>                                        implement the access callback
>>> for ptrace, see amdgpu_ttm_access_memory for
>>>                                        an example how to do this.
>>>
>>>                          As also mentioned above, we don't work via
>>> ptrace at all when it comes
>>>                          to debugging the EUs. The only thing used
>>> for now is the ptrace_may_access to
>>>                          implement similar access restrictions as
>>> ptrace has. This can be changed
>>>                          to something else if needed.
>>>
>>>
>>>                                    Ptrace access via
>>> vm_operations_struct.access → ttm_bo_vm_access.
>>>
>>>                                    This series renames
>>> ttm_bo_vm_access to ttm_bo_access, with no code changes.
>>>
>>>                                    The above function accesses a BO
>>> via kmap if it is in SYSTEM / TT,
>>>                                    which is existing code.
>>>
>>>                                    This function is only exposed to
>>> user space via ptrace permissions.
>>>
>>>                          Maybe this sentence is what caused the
>>> confusion.
>>>
>>>                          Userspace is never exposed with peek/poke
>>> interface, only the debugger
>>>                          connection which is its own FD.
>>>
>>>
>>>                                    In this series, we implement a
>>> function [3] similar to
>>>                                    amdgpu_ttm_access_memory for the
>>> TTM vfunc access_memory. What is
>>>                                    missing is non-visible CPU memory
>>> access, similar to
>>>                                    amdgpu_ttm_access_memory_sdma.
>>> This will be addressed in a follow-up and
>>>                                    was omitted in this series given
>>> its complexity.
>>>
>>>                                    So, this looks more or less
>>> identical to AMD's ptrace implementation,
>>>                                    but in GPU address space. Again,
>>> I fail to see what the problem is here.
>>>                                    What am I missing?
>>>
>>>
>>>                              The main question is why can't you use
>>> the existing interfaces directly?
>>>
>>>                          We're not working on the CPU address space
>>> or BOs. We're working
>>>                          strictly on the GPU address space as would
>>> be seen by an EU thread if it
>>>                          accessed address X.
>>>
>>>
>>>                              Additional to the peek/poke interface
>>> of ptrace Linux has the pidfd_getfd
>>>                              system call, see
>>> herehttps://man7.org/linux/man-pages/man2/pidfd_getfd.2.html.
>>>
>>>                              The pidfd_getfd() allows to dup() the
>>> render node file descriptor into your gdb
>>>                              process. That in turn gives you all the
>>> access you need from gdb, including
>>>                              mapping BOs and command submission on
>>> behalf of the application.
>>>
>>>                          We're not operating on the CPU address
>>> space nor are we operating on BOs
>>>                          (there is no concept of BO in the EU debug
>>> interface). Each VMA in the VM
>>>                          could come from anywhere, only the start
>>> address and size matter. And
>>>                          neither do we need to interfere with the
>>> command submission of the
>>>                          process under debug.
>>>
>>>
>>>                              As far as I can see that allows for the
>>> same functionality as the eudebug
>>>                              interface, just without any driver
>>> specific code messing with ptrace
>>>                              permissions and peek/poke interfaces.
>>>
>>>                              So the question is still why do you
>>> need the whole eudebug interface in the
>>>                              first place? I might be missing
>>> something, but that seems to be superfluous
>>>                              from a high level view.
>>>
>>>                          Recapping from above. It is to allow the
>>> debugging of EU threads per DRM
>>>                          client, completely independent of the CPU
>>> process. If ptrace_may_acces
>>>                          is the sore point, we could consider other
>>> permission checks, too. There
>>>                          is no other connection to ptrace in this
>>> architecture as single
>>>                          permission check to know if PID is fair
>>> game to access by debugger
>>>                          process.
>>>
>>>                          Why no parasitic thread or ptrace: Going
>>> forward, binding the EU debugging to
>>>                          the DRM client would also pave way for
>>> being able to extend core kernel generated
>>>                          core dump with each DRM client's EU
>>> thread/memory dump. We have similar
>>>                          feature called "Offline core dump" enabled
>>> in the downstream public
>>>                          trees for i915, where we currently attach
>>> the EU thread dump to i915 error state
>>>                          and then later combine i915 error state
>>> with CPU core dump file with a
>>>                          tool.
>>>
>>>                          This is relatively little amount of extra
>>> code, as this baseline series
>>>                          already introduces GDB the ability to
>>> perform the necessary actions.
>>>                          It's just the matter of kernel driver
>>> calling: "stop all threads", then
>>>                          copying the memory map and memory contents
>>> for GPU threads, just like is
>>>                          done for CPU threads.
>>>
>>>                          With parasitic thread injection, not sure
>>> if there is such way forward,
>>>                          as it would seem to require to inject quite
>>> abit more logic to core kernel?
>>>
>>>
>>>                              It's true that the AMD KFD part has
>>> still similar functionality, but that is
>>>                              because of the broken KFD design of
>>> tying driver state to the CPU process
>>>                              (which makes it inaccessible for gdb
>>> even with imported render node fd).
>>>
>>>                              Both Sima and I (and partially Dave as
>>> well) have pushed back on the KFD
>>>                              approach. And the long term plan is to
>>> get rid of such device driver specific
>>>                              interface which re-implement existing
>>> functionality just differently.
>>>
>>>                          Recapping, this series is not adding it
>>> back. The debugger connection
>>>                          is a separate FD from the DRM one, with
>>> separate IOCTL set. We don't allow
>>>                          the DRM FD any new operations based on
>>> ptrace is attached or not. We
>>>                          don't ever do that check even.
>>>
>>>                          We only restrict the opening of the
>>> debugger connection to given PID with
>>>                          ptrace_may_access check for now. That can
>>> be changed to something else,
>>>                          if necessary.
>>>
>>>                      Yeah I think unnecessarily tying gpu processes
>>> to cpu processes is a bad
>>>                      thing, least because even today all the svm
>>> discussions we have still hit
>>>                      clear use-cases, where a 1:1 match is not
>>> wanted (like multiple gpu svm
>>>                      sections with offsets). Not even speaking of
>>> all the gpu usecases where
>>>                      the gpu vm space is still entirely independent
>>> of the cpu side.
>>>
>>>                      So that's why I think this entirely separate
>>> approach looks like the right
>>>                      one, with ptrace_may_access as the access
>>> control check to make sure we
>>>                      match ptrace on the cpu side.
>>>
>>>                      But there's very obviously a bikeshed to be had
>>> on what the actual uapi
>>>                      should look like, especially how gdb opens up a
>>> gpu debug access fd. But I
>>>                      also think that's not much on drm to decide,
>>> but whatever gdb wants. And
>>>                      then we aim for some consistency on that
>>> lookup/access control part
>>>                      (ideally, I might be missing some reasons why
>>> this is a bad idea) across
>>>                      drm drivers.
>>>
>>>
>>>                              So you need to have a really really
>>> good explanation why the eudebug interface
>>>                              is actually necessary.
>>>
>>>                          TL;DR The main point is to decouple the
>>> debugging of the EU workloads from the
>>>                          debugging of the CPU process. This avoids
>>> the interference with the CPU process with
>>>                          parasitic thread injection. Further this
>>> also allows generating a core dump
>>>                          without any GDB connected. There are also
>>> many other smaller pros/cons
>>>                          which can be discussed but for the context
>>> of this patch, this is the
>>>                          main one.
>>>
>>>                          So unlike parasitic thread injection, we
>>> don't unlock any special IOCTL for
>>>                          the process under debug to be performed by
>>> the parasitic thread, but we
>>>                          allow the minimal set of operations to be
>>> performed by GDB as if those were
>>>                          done on the EUs themselves.
>>>
>>>                          One can think of it like the minimal subset
>>> of ptrace but for EU threads,
>>>                          not the CPU threads. And thus, building on
>>> this it's possible to extend
>>>                          the core kernel generated core dumps with
>>> DRM specific extension which
>>>                          would contain the EU thread/memory dump.
>>>
>>>                      It might be good to document (in that debugging
>>> doc patch probably) why
>>>                      thread injection is not a great option, and why
>>> the tradeoffs for
>>>                      debugging are different than for for
>>> checkpoint/restore, where with CRIU
>>>                      we landed on doing most of this in userspace,
>>> and often requiring
>>>                      injection threads to make it all work.
>>>
>>>                      Cheers, Sima
>>>
>>>
>>>                          Regards, Joonas
>>>
>>>
>>>                              Regards,
>>>                              Christian.
>>>
>>>
>>>
>>>                                    Matt
>>>
>>>                                   
>>> [3]https://patchwork.freedesktop.org/patch/622520/?series=140200&re
>>> v=6
>>>
>>>
>>>                                        Regards,
>>>                                        Christian.
>>>
>>>
>>>                                            Matt
>>>
>>>
>>>                                                Regards,
>>>                                                Christian.
>>>
>>>
>>>