<html>
<head>
<base href="https://bugs.freedesktop.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy)."
href="https://bugs.freedesktop.org/show_bug.cgi?id=107670">107670</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).
</td>
</tr>
<tr>
<th>Product</th>
<td>Mesa
</td>
</tr>
<tr>
<th>Version</th>
<td>unspecified
</td>
</tr>
<tr>
<th>Hardware</th>
<td>x86 (IA32)
</td>
</tr>
<tr>
<th>OS</th>
<td>All
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>medium
</td>
</tr>
<tr>
<th>Component</th>
<td>Other
</td>
</tr>
<tr>
<th>Assignee</th>
<td>mesa-dev@lists.freedesktop.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>iive@yahoo.com
</td>
</tr>
<tr>
<th>QA Contact</th>
<td>mesa-dev@lists.freedesktop.org
</td>
</tr></table>
<p>
<div>
<pre>I've traced the massive slowdown to the memcpy() in
"mesa/src/gallium/auxiliary/util/u_upload_mgr.c::u_upload_data()" that seems to
be used to move data from the host memory into the video card memory.
The slowdown could be observed if non-SIMD version of the glibc-2.27 function
is used (like the one that comes with the 32 bit Slackware-current). The system
mesa3d package does not exhibit the same slowdown, but it seems to be linked to
glibc-2.5.
I do suspect that the slowdown is caused by memcpy() implementation that copies
data backwards, starting from the end and moving to the beginning. This is
likely treated as non-sequential data transfer over the PCI bus (it probably
sends the full 32 bit address for every 32 bits of data).
Using SSE2 memcpy seems to avoid this problem, but I have no idea if it is
because it copies more data at once or because it copies forward.
In my benchmarks, `perf top` showed that the problematic memcpy() consumes 25%
CPU time. In a particular game benchmark, I was getting 50fps instead of 70fps.
Just replacing that memcpy() with memmove() fixed the issue for me, without
having to recompile and replace glibc.
However I do not consider it reliable fix, as there is nothing guaranteeing
that memmove() would do the right thing.
I think that the correct solution would be to create a new function
memcpy_to_pci() and having assembly implementation(s) that are specifically
crafted to maximize PCI/PCIe throughput.
The kernel has memcpy_toio/fromio(), but they don't seem to be asm optimized.
I've seen MPlayer MMX optimized mem2agpcpy() in aclib_template.c .</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the assignee for the bug.</li>
<li>You are the QA Contact for the bug.</li>
</ul>
</body>
</html>