<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
Am 10.12.24 um 12:57 schrieb Joonas Lahtinen:<br>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">Quoting Christian König (2024-12-10 12:00:48)
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Am 10.12.24 um 10:33 schrieb Joonas Lahtinen:
Quoting Christian König (2024-12-09 17:42:32)
Am 09.12.24 um 16:31 schrieb Simona Vetter:
On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote:
Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
From: Andrzej Hajda <a class="moz-txt-link-rfc2396E" href="mailto:andrzej.hajda@intel.com"><andrzej.hajda@intel.com></a>
Debugger needs to read/write program's vmas including userptr_vma.
Since hmm_range_fault is used to pin userptr vmas, it is possible
to map those vmas from debugger context.
Oh, this implementation is extremely questionable as well. Adding the LKML
and the MM list as well.
First of all hmm_range_fault() does *not* pin anything!
In other words you don't have a page reference when the function returns,
but rather just a sequence number you can check for modifications.
I think it's all there, holds the invalidation lock during the critical
access/section, drops it when reacquiring pages, retries until it works.
I think the issue is more that everyone hand-rolls userptr.
Well that is part of the issue.
The general problem here is that the eudebug interface tries to simulate
the memory accesses as they would have happened by the hardware.
Could you elaborate, what is that a problem in that, exactly?
It's pretty much the equivalent of ptrace() poke/peek but for GPU memory.
Exactly that here. You try to debug the GPU without taking control of the CPU
process.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
You seem to have a built-in expectation that the CPU threads and memory space
must be interfered with in order to debug a completely different set of threads
and memory space elsewhere that executes independently. I don't quite see why?</pre>
</blockquote>
<br>
Because the GPU only gets the information it needs to execute the
commands.<br>
<br>
A simple example would be to single step through the high level
shader code. That is usually not available to the GPU, but only to
the application who has submitted the work.<br>
<br>
The GPU only sees the result of the compiler from high level into
low level assembler.<br>
<br>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">In debugging massively parallel workloads, it's a huge drawback to be limited to
stop all mode in GDB. If ROCm folks are fine with such limitation, I have nothing
against them keeping that limitation. Just it was a starting design principle for
this design to avoid such a limitation.</pre>
</blockquote>
<br>
Well, that's the part I don't understand. Why is that a drawback?<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">This means that you have to re-implement all debug functionalities which where
previously invested for the CPU process for the GPU once more.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Seems like a strawman argument. Can you list the "all interfaces" being added
that would be possible via indirection via ptrace() beyond peek/poke?
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">And that in turn creates a massive attack surface for security related
problems, especially when you start messing with things like userptrs which
have a very low level interaction with core memory management.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Again, just seems like a strawman argument. You seem to generalize to some massive
attack surface of hypothetical interfaces which you don't list. We're talking
about memory peek/poke here.</pre>
</blockquote>
<br>
That peek/poke interface is more than enough to cause problems.<br>
<br>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">Can you explain the high-level difference from security perspective for
temporarily pinning userptr pages to write them to page tables for GPU to
execute a dma-fence workload with and temporarily pinning pages for
peek/poke?</pre>
</blockquote>
<br>
If you want to access userptr imported pages from the GPU going
through the hops of using hhm_range_fault()/get_user_pages() plus an
MMU notifier is a must have.<br>
<br>
For a CPU based debugging interface that isn't necessary, you can
just look directly into the application address space with existing
interfaces.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> And it is exactly the kind of interface that makes sense for debugger as
GPU memory != CPU memory, and they don't need to align at all.
And that is what I strongly disagree on. When you debug the GPU it is mandatory
to gain control of the CPU process as well.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
You are free to disagree on that. I simply don't agree and have in this
and previous email presented multiple reasons as to why not. We can
agree to disagree on the topic.</pre>
</blockquote>
<br>
Yeah, that's ok. I also think we can agree on that this doesn't
matter for the discussion.<br>
<br>
The question is rather should the userptr functionality be used for
debugging or not.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">The CPU process is basically the overseer of the GPU activity, so it should
know everything about the GPU operation, for example what a mapping actually
means.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
How does that relate to what is being discussed here? You just seem to
explain how you think userspace driver should work: Maintain a shadow
tree of each ppGTT VM layout? I don't agree on that, but I think it is
slightly irrelevant here.</pre>
</blockquote>
<br>
I'm trying to understand why you want to debug only the GPU without
also attaching to the CPU process.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">The kernel driver and the hardware only have the information necessary to
execute the work prepared by the CPU process. So the information available is
limited to begin with.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
And the point here is? Are you saying kernel does not know the actual mappings
maintained in the GPU page tables?</pre>
</blockquote>
<br>
The kernel knows that, the question is why does userspace don't know
that?<br>
<br>
On the other hand I have to agree that this isn't much of a problem.<br>
<br>
If userspace really doesn't know what is mapped where in the GPU's
VM address space then an IOCTL to query that is probably not an
issue.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> What the debugger should probably do is to cleanly attach to the
application, get the information which CPU address is mapped to which
GPU address and then use the standard ptrace interfaces.
I don't quite agree here -- at all. "Which CPU address is mapped to
which GPU address" makes no sense when the GPU address space and CPU
address space is completely controlled by the userspace driver/application.
Yeah, that's the reason why you should ask the userspace driver/application for
the necessary information and not go over the kernel to debug things.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
What hypothetical necessary information are you referring to exactly?</pre>
</blockquote>
<br>
What you said before: "<span style="white-space: pre-wrap">the GPU address space and CPU </span><span style="white-space: pre-wrap">address space is completely controlled by the userspace driver/application".
When that's the case, then why as the kernel for help? The driver/application is in control.
</span>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">I already explained there are good reasons not to map all the GPU memory
into the CPU address space.</pre>
</blockquote>
<br>
Well I still don't fully agree to that argumentation, but compared
to using userptr the peek/pook on a GEM handle is basically
harmless.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> Please try to consider things outside of the ROCm architecture.
Well I consider a good part of the ROCm architecture rather broken exactly
because we haven't pushed back hard enough on bad ideas.
Something like a register scratch region or EU instructions should not
even be mapped to CPU address space as CPU has no business accessing it
during normal operation. And backing of such region will vary per
context/LRC on the same virtual address per EU thread.
You seem to be suggesting to rewrite even our userspace driver to behave
the same way as ROCm driver does just so that we could implement debug memory
accesses via ptrace() to the CPU address space.
Oh, well certainly not. That ROCm has an 1 to 1 mapping between CPU and GPU is
one thing I've pushed back massively on and has now proven to be problematic.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Right, so is your claim then that instead of being 1:1 the CPU address space
should be a superset of all GPU address spaces instead to make sure
ptrace() can modify all memory?</pre>
</blockquote>
<br>
Well why not? Mapping a BO and not accessing it has only minimal
overhead.<br>
<br>
We already considered to making that mandatory for TTM drivers for
better OOM killer handling. That approach was discontinued, but
certainly not for the overhead.<br>
<br>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">Cause I'm slightly lost here as you don't give much reasoning, just
claim things to be certain way.</pre>
</blockquote>
<br>
Ok, that's certainly not what I'm trying to express.<br>
<br>
Things don't need to be in a certain way, especially not in the way
ROCm does things.<br>
<br>
But you should not try to re-create GPU accesses with the CPU,
especially when that isn't memory you have control over in the sense
that it was allocated through your driver stack.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> That seems bit of a radical suggestion, especially given the drawbacks
pointed out in your suggested design.
The whole interface re-invents a lot of functionality which is already
there
I'm not really sure I would call adding a single interface for memory
reading and writing to be "re-inventing a lot of functionality".
All the functionality behind this interface will be needed by GPU core
dumping, anyway. Just like for the other patch series.
As far as I can see exactly that's an absolutely no-go. Device core dumping
should *never ever* touch memory imported by userptrs.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Could you again elaborate on what the great difference is to short term
pinning to use in dma-fence workloads? Just the kmap?</pre>
</blockquote>
<br>
The big difference is that the memory doesn't belong to the driver
who is core dumping.<br>
<br>
That is just something you have imported from the MM subsystem, e.g.
anonymous memory and file backed mappings.<br>
<br>
We also don't allow to mmap() dma-bufs on importing devices for
similar reasons.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">That's what process core dumping is good for.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Not really sure I agree. If you do not dump the memory as seen by the
GPU, then you need to go parsing the CPU address space in order to make
sense which buffers were mapped where and that CPU memory contents containing
metadata could be corrupt as we're dealing with a crashing app to begin with.
Big point of relying to the information from GPU VM for the GPU memory layout
is that it won't be corrupted by rogue memory accesses in CPU process.</pre>
</blockquote>
<br>
Well that you don't want to use potentially corrupted information is
a good argument, but why just not dump an information like "range
0xabcd-0xbcde came as userptr from process 1 VMA 0x1234-0x5678" ?<br>
<br>
A process address space is not really something a device driver
should be messing with.<br>
<br>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> just because you don't like the idea to attach to the debugged
application in userspace.
A few points that have been brought up as drawback to the
GPU debug through ptrace(), but to recap a few relevant ones for this
discussion:
- You can only really support GDB stop-all mode or at least have to
stop all the CPU threads while you control the GPU threads to
avoid interference. Elaborated on this on the other threads more.
- Controlling the GPU threads will always interfere with CPU threads.
Doesn't seem feasible to single-step an EU thread while CPU threads
continue to run freely?
I would say no.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Should this be understood that you agree these are limitations of the ROCm
debug architecture?</pre>
</blockquote>
<br>
ROCm has a bunch of design decisions I would say we should never
ever repeat:<br>
<br>
1. Forcing a 1 to 1 model between GPU address space and CPU address
space.<br>
<br>
2. Using a separate file descriptor additional to the DRM render
node.<br>
<br>
3. Attaching information and context to the CPU process instead of
the DRM render node.<br>
....<br>
<br>
But stopping the world, e.g. both CPU and GPU threads if you want to
debug something is not one of the problematic decisions.<br>
<br>
That's why I'm really surprised that you insist so much on that.<br>
<br>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> - You are very much restricted by the CPU VA ~ GPU VA alignment
requirement, which is not true for OpenGL or Vulkan etc. Seems
like one of the reasons why ROCm debugging is not easily extendable
outside compute?
Well as long as you can't take debugged threads from the hardware you can
pretty much forget any OpenGL or Vulkan debugging with this interface since it
violates the dma_fence restrictions in the kernel.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Agreed. However doesn't mean because you can't do it right now, you you should
design an architecture that actively prevents you from doing that in the future.</pre>
</blockquote>
<br>
Good point. That's what I can totally agree on as well.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""> - You have to expose extra memory to CPU process just for GPU
debugger access and keep track of GPU VA for each. Makes the GPU more
prone to OOB writes from CPU. Exactly what not mapping the memory
to CPU tried to protect the GPU from to begin with.
As far as I can see this whole idea is extremely questionable. This
looks like re-inventing the wheel in a different color.
I see it like reinventing a round wheel compared to octagonal wheel.
Could you elaborate with facts much more on your position why the ROCm
debugger design is an absolute must for others to adopt?
Well I'm trying to prevent some of the mistakes we did with the ROCm design.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Well, I would say that the above limitations are direct results of the ROCm
debugging design. So while we're eager to learn about how you perceive
GPU debugging should work, would you mind addressing the above
shortcomings?</pre>
</blockquote>
<br>
Yeah, absolutely. That you don't have a 1 to 1 mapping on the GPU is
a step in the right direction if you ask me.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">And trying to re-invent well proven kernel interfaces is one of the big
mistakes made in the ROCm design.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Appreciate the feedback. Please work on the representation a bit as it currently
doesn't seem very helpful but appears just as an attempt to try to throw a spanner
in the works.
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">If you really want to expose an interface to userspace
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
To a debugger process, enabled only behind a flag.
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">which walks the process
page table, installs an MMU notifier
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
This part is already done to put an userptr to the GPU page tables to
begin with. So hopefully not too controversial.
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">kmaps the resulting page
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
In addition to having it in the page tables where GPU can access it.
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">and then memcpy
to/from it then you absolutely *must* run that by guys like Christoph Hellwig,
Andrew and even Linus.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Surely, that is why we're seeking out for review.
We could also in theory use an in-kernel GPU context on the GPU hardware for
doing the peek/poke operations on userptr.</pre>
</blockquote>
<br>
Yeah, I mean that should clearly work out. We have something
similar.<br>
<br>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">But that seems like a high-overhead thing to do due to the overhead of
setting up a transfer per data word and going over the PCI bus twice
compared to accessing the memory directly by CPU when it trivially can.</pre>
</blockquote>
<br>
Understandable, but that will create another way of accessing
process memory.<br>
<br>
Regards,<br>
Christian.<br>
<br>
<blockquote type="cite" cite="mid:173383187817.17709.7100544929981970614@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">
So this is the current proposal.
Regards, Joonas
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
I'm pretty sure that those guys will note that a device driver should
absolutely not mess with such stuff.
Regards,
Christian.
Otherwise it just looks like you are trying to prevent others from
implementing a more flexible debugging interface through vague comments about
"questionable design" without going into details. Not listing much concrete
benefits nor addressing the very concretely expressed drawbacks of your
suggested design, makes it seem like a very biased non-technical discussion.
So while review interest and any comments are very much appreciated, please
also work on providing bit more reasoning and facts instead of just claiming
things. That'll help make the discussion much more fruitful.
Regards, Joonas
</pre>
</blockquote>
</blockquote>
<br>
</body>
</html>