<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Uff, well that's quite a problem you
ran into here.<br>
<br>
IOMMU might not help here, cause when it would be the GPU we would
have made a mapping once and then the page in question is never
unmapped (IIRC).<br>
<br>
To confirm if it is really the GPU writing those bytes I would add
a trace point to amdgpu_ttm_tt_populate() to see which pages the
GPU got assigned.<br>
<br>
If the page with the corruption is not in that list it is unlikely
(but not impossible) that the GPU is the one doing the corruption.<br>
<br>
Good luck,<br>
Christian.<br>
<br>
Am 14.02.2017 um 14:20 schrieb Nicolai Hähnle:<br>
</div>
<blockquote
cite="mid:0f5edda6-e162-78a9-d065-6f53254f5fa1@gmail.com"
type="cite">Hi all,
<br>
<br>
on an amd-staging-4.9 kernel with lock debugging and KASAN
enabled, I am seeing a bug where I suspect that the GPU may be
writing into system memory where it shouldn't.
<br>
<br>
I can reproduce errors fairly reliable by running a parallel
piglit run on 8 cores with Tonga.
<br>
<br>
See exhibit1 and exhibit2 for two of the errors that were
reported. As you can see, poison data of a dead object was
overwritten.
<br>
<br>
If that was done by a use-after-free in kernel code, I would
expect to see a KASAN error about it, but I don't. Furthermore,
the pattern of overwritten values is quite unusual: single bytes,
with 8 byte stride, many times all of them the same value. This is
the kind of pattern that could fit GPU writes to an 8-bit texture.
<br>
<br>
See kasan-corrupted for another type of report that I've seen.
This report looks like KASAN's internal data structures were
corrupted, leading to a crash.
<br>
<br>
Needless to say, while I can reproduce those crashes fairly
reliably, they are totally non-deterministic.
<br>
<br>
So the question is how to figure out where the bad memory writes
happen.
<br>
<br>
I noticed that the IOMMU on the system was disabled by the BIOS,
so I enabled it, in the hopes that that would catch bad GPU
behavior.
<br>
<br>
Well, this leads to lots of IO_PAGE_FAULT message during the
amdgpu module initialization (see dmesg-iommu). When running
piglit, however, I get the same type of random memory corruption
errors / crashes as before, and no IOMMU errors.
<br>
<br>
Any ideas on (a) what kind of tools could be helpful in tracking
this problem down (if any...), and (b) where in the code the
problem lies?
<br>
<br>
I suspect something's wrong with GART mappings when buffers are
moved, but that's pretty vague...
<br>
<br>
Thanks,
<br>
Nicolai
<br>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
amd-gfx mailing list
<a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a>
<a class="moz-txt-link-freetext" href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a>
</pre>
</blockquote>
<p><br>
</p>
</body>
</html>