[PATCH v6 2/8] drm/ttm: Add ttm_bo_access
Christian König
christian.koenig at amd.com
Tue Nov 12 16:25:24 UTC 2024
Am 12.11.24 um 17:22 schrieb Thomas Hellström:
> On Tue, 2024-11-12 at 15:41 +0200, Joonas Lahtinen wrote:
>> (+ Thomas)
>>
>> Quoting Christian König (2024-11-12 11:23:36)
>>> Am 11.11.24 um 23:45 schrieb Matthew Brost:
>>>
>>> [SNIP]
>>>
>>> So I think only way to allow interactive debugging is
>>> to avoid the
>>> dma_fences. Curious to hear if there are ideas for
>>> otherwise.
>>>
>>> You need to guarantee somehow that the process is taken
>>> from the hardware so
>>> that the preemption fence can signal.
>>>
>>>
>>> Our preemption fences have this functionality.
>>>
>>> A preemption fence issues a suspend execution command to the
>>> firmware. The
>>> firmware, in turn, attempts to preempt the workload. If it
>>> doesn't respond
>>> within a specified period, it resets the hardware queue, sends
>>> a message to KMD,
>>> bans the software queue, and signals the preemption fence.
>>>
>>> We provide even more protection than that. If, for some reason,
>>> the firmware
>>> doesn't respond within a longer timeout period, the KMD
>>> performs a device reset,
>>> ban the offending software queue(s), and will signal the
>>> preemption fences.
>>>
>>> This flow remains the same whether a debugger is attached or,
>>> for example, a
>>> user submits a 10-minute non-preemptable workload. In either
>>> case, other
>>> processes are guaranteed to make forward progress.
>>>
>>>
>>> Yeah that is pretty much the same argumentation I have heard before
>>> and it
>>> turned out to not be working.
>>>
>>>
>>> The example above illustrates the memory oversubscription case,
>>> where two
>>> processes are using 51% of the memory.
>>>
>>>
>>> That isn't even necessary. We have seen applications dying just
>>> because the
>>> core memory management tried to join back small pages into huge
>>> pages in an
>>> userptr.
>>>
>>> That the core memory management jumps in and requests that the pre-
>>> emption
>>> fence signals can happen all the time.
>> Ouch. Does there happen to be a known reproducer for this behavior or
>> maybe
>> bug report?
>>
>>> You can mitigate that a bit, Fedora for example disables joining
>>> back small
>>> pages into huge pages by default for example and we even had people
>>> suggesting
>>> to use mprotect() so that userptrs VMAs don't fork() any more
>>> (which is of
>>> course completely illegal).
>>>
>>> But my long term take away is that you can't block all causes of
>>> sudden
>>> requests to let a pre-emption fence signal.
>> I think this problem equally applies to the LR-workloads like the EU
>> debugging ones.
>>
>>> Another preemption scenario involves two processes sharing
>>> hardware resources.
>>> Our firmware follows the same flow here. If an LR workload is
>>> using a hardware
>>> resource and a DMA-fence workload is waiting, and if the LR
>>> workload doesn't
>>> preempt the in a timely manner, the firmware issues a hardware
>>> reset, notifies
>>> KMD, and bans the LR software queue. The DMA-fence workload
>>> then can make
>>> forward progress
>>>
>>> With the above in mind, this is why I say that if a user tries
>>> to run a game and
>>> a non-preemptable LR workload, either oversubscribing memory or
>>> sharing hardware
>>> resources, it is unlikely to work well. However, I don't think
>>> this is a common
>>> use case. I would expect that when a debugger is open, it is
>>> typically by a
>>> power user who knows how to disable other GPU tasks (e.g., by
>>> enabling software
>>> rendering or using a machine without any display).
>>>
>>> Given this, please to reconsider your position.
>>>
>>>
>>> The key point here is that this isn't stable, you can do that as a
>>> tech demo
>>> but it can always be that debugging an application just randomly
>>> dies. And
>>> believe me AMD has tried this to a rather extreme extend as well.
>> It's not really only limited to the debuggable applications at all,
>> the
>> normal LR workloads are equally impacted as far as I understand. Just
>> harder to catch the issue with LR-workloads if the pre-emption fence
>> signaling is sporadic.
>>
>>> What you could potentially work is to taint the kernel and make
>>> sure that this
>>> function is only available to user who absolutely know what they
>>> are doing.
>>>
>>> But I would say we can only allow that if all other options have
>>> been exercised
>>> and doing it like this is really the only option left.
>> It sounds like servicing the memory pre-empt fence by stealing the
>> pages from underneath the workload would be the way to resolve this
>> issue.
>>
>> This has been extensively discussed already, but was expected to
>> really
>> only be needed for low-on-memory scenarios. However it now seems like
>> the need is much earlier due to the random userptr page joining by
>> core
>> mm.
> Just to clarify here:
>
> In Long-Running mode with recoverable pagefaults enabled we don't have
> any preempt-fences, but rather just zap the PTEs pointing to the
> affected memory and flush TLB. So from a memory resource POW a
> breakpoint should be safe, and no mmu notifier nor shrinker will be
> blocked.
That sounds like a HMM based approach which would clearly work.
But where is that? I don't see any HMM based approach anywhere in the XE
driver.
Regards,
Christian.
>
> Nor will there be any jobs with published dma-fences depending on the
> job blocked either temporarily by a pagefault or long-term by a
> debugger breakpoint.
>
> /Thomas
>
>
>> If that is done and the memory pre-empt fence is serviced even for
>> debuggable contexts, do you have further concerns with the presented
>> approach
>> from dma-buf and drm/sched perspective?
>>
>> Regards, Joonas
>>
>>> Regards,
>>> Christian.
>>>
>>>
>>> This means that a breakpoint or core dump doesn't halt GPU
>>> threads, but
>>> rather suspends them. E.g. all running wave data is
>>> collected into a state
>>> bag which can be restored later on.
>>>
>>> I was under the impression that those long running compute
>>> threads do
>>> exactly that, but when the hardware can't switch out the
>>> GPU thread/process
>>> while in a break then that isn't the case.
>>>
>>> As long as you don't find a way to avoid that this patch
>>> set is a pretty
>>> clear NAK from my side as DMA-buf and TTM maintainer.
>>>
>>>
>>> I believe this is addressed above.
>>>
>>> Matt
>>>
>>>
>>> What might work is to keep the submission on the hardware
>>> in the break state
>>> but forbid any memory access. This way you can signal your
>>> preemption fence
>>> even when the hardware isn't made available.
>>>
>>> Before you continue XE setups a new pre-emption fence and
>>> makes sure that
>>> all page tables etc... are up to date.
>>>
>>> Could be tricky to get this right if completion fence based
>>> submissions are
>>> mixed in as well, but that gives you at least a direction
>>> you could
>>> potentially go.
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>> Regards, Joonas
>>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>> Some wash-up thoughts from me below, but
>>> consider them fairly irrelevant
>>> since I think the main driver for these big
>>> questions here should be
>>> gdb/userspace.
>>>
>>>
>>> Quoting Christian König (2024-11-07
>>> 11:44:33)
>>>
>>> Am 06.11.24 um 18:00 schrieb Matthew
>>> Brost:
>>>
>>> [SNIP]
>>>
>>> This is not a generic interface
>>> that anyone can freely access. The same
>>> permissions used by ptrace are
>>> checked when opening such an interface.
>>> See [1] [2].
>>>
>>>
>>> [1]https://patchwork.freedesktop.org/patch/617470/?series=136572&re
>>> v=2
>>>
>>> [2]https://patchwork.freedesktop.org/patch/617471/?series=136572&re
>>> v=2
>>>
>>>
>>> Thanks a lot for those pointers, that
>>> is exactly what I was looking for.
>>>
>>> And yeah, it is what I feared. You are
>>> re-implementing existing functionality,
>>> but see below.
>>>
>>> Could you elaborate on what this "existing
>>> functionality" exactly is?
>>> I do not think this functionality exists at
>>> this time.
>>>
>>> The EU debugging architecture for Xe
>>> specifically avoids the need for GDB
>>> to attach with ptrace to the CPU process or
>>> interfere with the CPU process for
>>> the debugging via parasitic threads or so.
>>>
>>> Debugger connection is opened to the DRM
>>> driver for given PID (which uses the
>>> ptrace may access check for now) after
>>> which the all DRM client of that
>>> PID are exposed to the debugger process.
>>>
>>> What we want to expose via that debugger
>>> connection is the ability for GDB to
>>> read/write the different GPU VM address
>>> spaces (ppGTT for Intel GPUs) just like
>>> the EU threads would see them. Note that
>>> the layout of the ppGTT is
>>> completely up to the userspace driver to
>>> setup and is mostly only partially
>>> equal to the CPU address space.
>>>
>>> Specifically as part of reading/writing the
>>> ppGTT for debugging purposes,
>>> there are deep flushes needed: for example
>>> flushing instruction cache
>>> when adding/removing breakpoints.
>>>
>>> Maybe that will explain the background. I
>>> elaborate on this at the end some more.
>>>
>>>
>>> kmap/vmap are used
>>> everywhere in the DRM subsystem to access BOs, so I’m
>>> failing to see the
>>> problem with adding a simple helper based on existing
>>> code.
>>>
>>> What#s possible and often
>>> done is to do kmap/vmap if you need to implement a
>>> CPU copy for scanout for
>>> example or for copying/validating command buffers.
>>> But that usually requires
>>> accessing the whole BO and has separate security
>>> checks.
>>>
>>> When you want to access only
>>> a few bytes of a BO that sounds massively like
>>> a peek/poke like interface
>>> and we have already rejected that more than once.
>>> There even used to be
>>> standardized GEM IOCTLs for that which have been
>>> removed by now.
>>>
>>> Referring to the explanation at top: These
>>> IOCTL are not for the debugging target
>>> process to issue. The peek/poke interface
>>> is specifically for GDB only
>>> to facilitate the emulation of memory
>>> reads/writes on the GPU address
>>> space as they were done by EUs themselves.
>>> And to recap: for modifying
>>> instructions for example (add/remove
>>> breakpoint), extra level of cache flushing is
>>> needed which is not available to regular
>>> userspace.
>>>
>>> I specifically discussed with Sima on the
>>> difference before moving forward with this
>>> design originally. If something has changed
>>> since then, I'm of course happy to rediscuss.
>>>
>>> However, if this code can't be added, not
>>> sure how we would ever be able
>>> to implement core dumps for GPU
>>> threads/memory?
>>>
>>>
>>> If you need to access BOs
>>> which are placed in not CPU accessible memory then
>>> implement the access callback
>>> for ptrace, see amdgpu_ttm_access_memory for
>>> an example how to do this.
>>>
>>> As also mentioned above, we don't work via
>>> ptrace at all when it comes
>>> to debugging the EUs. The only thing used
>>> for now is the ptrace_may_access to
>>> implement similar access restrictions as
>>> ptrace has. This can be changed
>>> to something else if needed.
>>>
>>>
>>> Ptrace access via
>>> vm_operations_struct.access → ttm_bo_vm_access.
>>>
>>> This series renames
>>> ttm_bo_vm_access to ttm_bo_access, with no code changes.
>>>
>>> The above function accesses a BO
>>> via kmap if it is in SYSTEM / TT,
>>> which is existing code.
>>>
>>> This function is only exposed to
>>> user space via ptrace permissions.
>>>
>>> Maybe this sentence is what caused the
>>> confusion.
>>>
>>> Userspace is never exposed with peek/poke
>>> interface, only the debugger
>>> connection which is its own FD.
>>>
>>>
>>> In this series, we implement a
>>> function [3] similar to
>>> amdgpu_ttm_access_memory for the
>>> TTM vfunc access_memory. What is
>>> missing is non-visible CPU memory
>>> access, similar to
>>> amdgpu_ttm_access_memory_sdma.
>>> This will be addressed in a follow-up and
>>> was omitted in this series given
>>> its complexity.
>>>
>>> So, this looks more or less
>>> identical to AMD's ptrace implementation,
>>> but in GPU address space. Again,
>>> I fail to see what the problem is here.
>>> What am I missing?
>>>
>>>
>>> The main question is why can't you use
>>> the existing interfaces directly?
>>>
>>> We're not working on the CPU address space
>>> or BOs. We're working
>>> strictly on the GPU address space as would
>>> be seen by an EU thread if it
>>> accessed address X.
>>>
>>>
>>> Additional to the peek/poke interface
>>> of ptrace Linux has the pidfd_getfd
>>> system call, see
>>> herehttps://man7.org/linux/man-pages/man2/pidfd_getfd.2.html.
>>>
>>> The pidfd_getfd() allows to dup() the
>>> render node file descriptor into your gdb
>>> process. That in turn gives you all the
>>> access you need from gdb, including
>>> mapping BOs and command submission on
>>> behalf of the application.
>>>
>>> We're not operating on the CPU address
>>> space nor are we operating on BOs
>>> (there is no concept of BO in the EU debug
>>> interface). Each VMA in the VM
>>> could come from anywhere, only the start
>>> address and size matter. And
>>> neither do we need to interfere with the
>>> command submission of the
>>> process under debug.
>>>
>>>
>>> As far as I can see that allows for the
>>> same functionality as the eudebug
>>> interface, just without any driver
>>> specific code messing with ptrace
>>> permissions and peek/poke interfaces.
>>>
>>> So the question is still why do you
>>> need the whole eudebug interface in the
>>> first place? I might be missing
>>> something, but that seems to be superfluous
>>> from a high level view.
>>>
>>> Recapping from above. It is to allow the
>>> debugging of EU threads per DRM
>>> client, completely independent of the CPU
>>> process. If ptrace_may_acces
>>> is the sore point, we could consider other
>>> permission checks, too. There
>>> is no other connection to ptrace in this
>>> architecture as single
>>> permission check to know if PID is fair
>>> game to access by debugger
>>> process.
>>>
>>> Why no parasitic thread or ptrace: Going
>>> forward, binding the EU debugging to
>>> the DRM client would also pave way for
>>> being able to extend core kernel generated
>>> core dump with each DRM client's EU
>>> thread/memory dump. We have similar
>>> feature called "Offline core dump" enabled
>>> in the downstream public
>>> trees for i915, where we currently attach
>>> the EU thread dump to i915 error state
>>> and then later combine i915 error state
>>> with CPU core dump file with a
>>> tool.
>>>
>>> This is relatively little amount of extra
>>> code, as this baseline series
>>> already introduces GDB the ability to
>>> perform the necessary actions.
>>> It's just the matter of kernel driver
>>> calling: "stop all threads", then
>>> copying the memory map and memory contents
>>> for GPU threads, just like is
>>> done for CPU threads.
>>>
>>> With parasitic thread injection, not sure
>>> if there is such way forward,
>>> as it would seem to require to inject quite
>>> abit more logic to core kernel?
>>>
>>>
>>> It's true that the AMD KFD part has
>>> still similar functionality, but that is
>>> because of the broken KFD design of
>>> tying driver state to the CPU process
>>> (which makes it inaccessible for gdb
>>> even with imported render node fd).
>>>
>>> Both Sima and I (and partially Dave as
>>> well) have pushed back on the KFD
>>> approach. And the long term plan is to
>>> get rid of such device driver specific
>>> interface which re-implement existing
>>> functionality just differently.
>>>
>>> Recapping, this series is not adding it
>>> back. The debugger connection
>>> is a separate FD from the DRM one, with
>>> separate IOCTL set. We don't allow
>>> the DRM FD any new operations based on
>>> ptrace is attached or not. We
>>> don't ever do that check even.
>>>
>>> We only restrict the opening of the
>>> debugger connection to given PID with
>>> ptrace_may_access check for now. That can
>>> be changed to something else,
>>> if necessary.
>>>
>>> Yeah I think unnecessarily tying gpu processes
>>> to cpu processes is a bad
>>> thing, least because even today all the svm
>>> discussions we have still hit
>>> clear use-cases, where a 1:1 match is not
>>> wanted (like multiple gpu svm
>>> sections with offsets). Not even speaking of
>>> all the gpu usecases where
>>> the gpu vm space is still entirely independent
>>> of the cpu side.
>>>
>>> So that's why I think this entirely separate
>>> approach looks like the right
>>> one, with ptrace_may_access as the access
>>> control check to make sure we
>>> match ptrace on the cpu side.
>>>
>>> But there's very obviously a bikeshed to be had
>>> on what the actual uapi
>>> should look like, especially how gdb opens up a
>>> gpu debug access fd. But I
>>> also think that's not much on drm to decide,
>>> but whatever gdb wants. And
>>> then we aim for some consistency on that
>>> lookup/access control part
>>> (ideally, I might be missing some reasons why
>>> this is a bad idea) across
>>> drm drivers.
>>>
>>>
>>> So you need to have a really really
>>> good explanation why the eudebug interface
>>> is actually necessary.
>>>
>>> TL;DR The main point is to decouple the
>>> debugging of the EU workloads from the
>>> debugging of the CPU process. This avoids
>>> the interference with the CPU process with
>>> parasitic thread injection. Further this
>>> also allows generating a core dump
>>> without any GDB connected. There are also
>>> many other smaller pros/cons
>>> which can be discussed but for the context
>>> of this patch, this is the
>>> main one.
>>>
>>> So unlike parasitic thread injection, we
>>> don't unlock any special IOCTL for
>>> the process under debug to be performed by
>>> the parasitic thread, but we
>>> allow the minimal set of operations to be
>>> performed by GDB as if those were
>>> done on the EUs themselves.
>>>
>>> One can think of it like the minimal subset
>>> of ptrace but for EU threads,
>>> not the CPU threads. And thus, building on
>>> this it's possible to extend
>>> the core kernel generated core dumps with
>>> DRM specific extension which
>>> would contain the EU thread/memory dump.
>>>
>>> It might be good to document (in that debugging
>>> doc patch probably) why
>>> thread injection is not a great option, and why
>>> the tradeoffs for
>>> debugging are different than for for
>>> checkpoint/restore, where with CRIU
>>> we landed on doing most of this in userspace,
>>> and often requiring
>>> injection threads to make it all work.
>>>
>>> Cheers, Sima
>>>
>>>
>>> Regards, Joonas
>>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>>
>>> Matt
>>>
>>>
>>> [3]https://patchwork.freedesktop.org/patch/622520/?series=140200&re
>>> v=6
>>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>> Matt
>>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>>
More information about the dri-devel
mailing list