<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
Am 10.12.24 um 10:33 schrieb Joonas Lahtinen:<br>
<blockquote type="cite" cite="mid:173382321353.8959.8314520413901294535@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">Quoting Christian König (2024-12-09 17:42:32)
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Am 09.12.24 um 16:31 schrieb Simona Vetter:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">From: Andrzej Hajda <a class="moz-txt-link-rfc2396E" href="mailto:andrzej.hajda@intel.com"><andrzej.hajda@intel.com></a>
Debugger needs to read/write program's vmas including userptr_vma.
Since hmm_range_fault is used to pin userptr vmas, it is possible
to map those vmas from debugger context.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">Oh, this implementation is extremely questionable as well. Adding the LKML
and the MM list as well.
First of all hmm_range_fault() does *not* pin anything!
In other words you don't have a page reference when the function returns,
but rather just a sequence number you can check for modifications.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">I think it's all there, holds the invalidation lock during the critical
access/section, drops it when reacquiring pages, retries until it works.
I think the issue is more that everyone hand-rolls userptr.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Well that is part of the issue.
The general problem here is that the eudebug interface tries to simulate
the memory accesses as they would have happened by the hardware.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Could you elaborate, what is that a problem in that, exactly?
It's pretty much the equivalent of ptrace() poke/peek but for GPU memory.</pre>
</blockquote>
<br>
Exactly that here. You try to debug the GPU without taking control
of the CPU process.<br>
<br>
This means that you have to re-implement all debug functionalities
which where previously invested for the CPU process for the GPU once
more.<br>
<br>
And that in turn creates a massive attack surface for security
related problems, especially when you start messing with things like
userptrs which have a very low level interaction with core memory
management.<br>
<br>
<blockquote type="cite" cite="mid:173382321353.8959.8314520413901294535@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">And it is exactly the kind of interface that makes sense for debugger as
GPU memory != CPU memory, and they don't need to align at all.</pre>
</blockquote>
<br>
And that is what I strongly disagree on. When you debug the GPU it
is mandatory to gain control of the CPU process as well.<br>
<br>
The CPU process is basically the overseer of the GPU activity, so it
should know everything about the GPU operation, for example what a
mapping actually means.<br>
<br>
The kernel driver and the hardware only have the information
necessary to execute the work prepared by the CPU process. So the
information available is limited to begin with.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:173382321353.8959.8314520413901294535@jlahtine-mobl.ger.corp.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">What the debugger should probably do is to cleanly attach to the
application, get the information which CPU address is mapped to which
GPU address and then use the standard ptrace interfaces.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
I don't quite agree here -- at all. "Which CPU address is mapped to
which GPU address" makes no sense when the GPU address space and CPU
address space is completely controlled by the userspace driver/application.</pre>
</blockquote>
<br>
Yeah, that's the reason why you should ask the userspace
driver/application for the necessary information and not go over the
kernel to debug things.<br>
<br>
<blockquote type="cite" cite="mid:173382321353.8959.8314520413901294535@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">Please try to consider things outside of the ROCm architecture.</pre>
</blockquote>
<br>
Well I consider a good part of the ROCm architecture rather broken
exactly because we haven't pushed back hard enough on bad ideas.<br>
<br>
<blockquote type="cite" cite="mid:173382321353.8959.8314520413901294535@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">Something like a register scratch region or EU instructions should not
even be mapped to CPU address space as CPU has no business accessing it
during normal operation. And backing of such region will vary per
context/LRC on the same virtual address per EU thread.
You seem to be suggesting to rewrite even our userspace driver to behave
the same way as ROCm driver does just so that we could implement debug memory
accesses via ptrace() to the CPU address space.</pre>
</blockquote>
<br>
Oh, well certainly not. That ROCm has an 1 to 1 mapping between CPU
and GPU is one thing I've pushed back massively on and has now
proven to be problematic.<br>
<br>
<blockquote type="cite" cite="mid:173382321353.8959.8314520413901294535@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">That seems bit of a radical suggestion, especially given the drawbacks
pointed out in your suggested design.
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">The whole interface re-invents a lot of functionality which is already
there
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
I'm not really sure I would call adding a single interface for memory
reading and writing to be "re-inventing a lot of functionality".
All the functionality behind this interface will be needed by GPU core
dumping, anyway. Just like for the other patch series.</pre>
</blockquote>
<br>
As far as I can see exactly that's an absolutely no-go. Device core
dumping should *never ever* touch memory imported by userptrs.<br>
<br>
That's what process core dumping is good for.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:173382321353.8959.8314520413901294535@jlahtine-mobl.ger.corp.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">just because you don't like the idea to attach to the debugged
application in userspace.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
A few points that have been brought up as drawback to the
GPU debug through ptrace(), but to recap a few relevant ones for this
discussion:
- You can only really support GDB stop-all mode or at least have to
stop all the CPU threads while you control the GPU threads to
avoid interference. Elaborated on this on the other threads more.
- Controlling the GPU threads will always interfere with CPU threads.
Doesn't seem feasible to single-step an EU thread while CPU threads
continue to run freely?</pre>
</blockquote>
<br>
I would say no.<br>
<br>
<blockquote type="cite" cite="mid:173382321353.8959.8314520413901294535@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">- You are very much restricted by the CPU VA ~ GPU VA alignment
requirement, which is not true for OpenGL or Vulkan etc. Seems
like one of the reasons why ROCm debugging is not easily extendable
outside compute?</pre>
</blockquote>
<br>
Well as long as you can't take debugged threads from the hardware
you can pretty much forget any OpenGL or Vulkan debugging with this
interface since it violates the dma_fence restrictions in the
kernel.<br>
<br>
<blockquote type="cite" cite="mid:173382321353.8959.8314520413901294535@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">- You have to expose extra memory to CPU process just for GPU
debugger access and keep track of GPU VA for each. Makes the GPU more
prone to OOB writes from CPU. Exactly what not mapping the memory
to CPU tried to protect the GPU from to begin with.
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">As far as I can see this whole idea is extremely questionable. This
looks like re-inventing the wheel in a different color.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
I see it like reinventing a round wheel compared to octagonal wheel.
Could you elaborate with facts much more on your position why the ROCm
debugger design is an absolute must for others to adopt?</pre>
</blockquote>
<br>
Well I'm trying to prevent some of the mistakes we did with the ROCm
design.<br>
<br>
And trying to re-invent well proven kernel interfaces is one of the
big mistakes made in the ROCm design.<br>
<br>
If you really want to expose an interface to userspace which walks
the process page table, installs an MMU notifier, kmaps the
resulting page and then memcpy to/from it then you absolutely *must*
run that by guys like Christoph Hellwig, Andrew and even Linus.<br>
<br>
I'm pretty sure that those guys will note that a device driver
should absolutely not mess with such stuff.<br>
<br>
Regards,<br>
Christian.<br>
<br>
<blockquote type="cite" cite="mid:173382321353.8959.8314520413901294535@jlahtine-mobl.ger.corp.intel.com">
<pre class="moz-quote-pre" wrap="">
Otherwise it just looks like you are trying to prevent others from
implementing a more flexible debugging interface through vague comments about
"questionable design" without going into details. Not listing much concrete
benefits nor addressing the very concretely expressed drawbacks of your
suggested design, makes it seem like a very biased non-technical discussion.
So while review interest and any comments are very much appreciated, please
also work on providing bit more reasoning and facts instead of just claiming
things. That'll help make the discussion much more fruitful.
Regards, Joonas
</pre>
</blockquote>
<br>
</body>
</html>