<html> <head> <base href="https://bugs.freedesktop.org/"> </head> <body><table border="1" cellspacing="0" cellpadding="8"> <tr> <th>Bug ID</th> <td><a class="bz_bug_link bz_status_NEW " title="NEW - Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy)." href="https://bugs.freedesktop.org/show_bug.cgi?id=107670">107670</a> </td> </tr> <tr> <th>Summary</th> <td>Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy). </td> </tr> <tr> <th>Product</th> <td>Mesa </td> </tr> <tr> <th>Version</th> <td>unspecified </td> </tr> <tr> <th>Hardware</th> <td>x86 (IA32) </td> </tr> <tr> <th>OS</th> <td>All </td> </tr> <tr> <th>Status</th> <td>NEW </td> </tr> <tr> <th>Severity</th> <td>normal </td> </tr> <tr> <th>Priority</th> <td>medium </td> </tr> <tr> <th>Component</th> <td>Other </td> </tr> <tr> <th>Assignee</th> <td>mesa-dev@lists.freedesktop.org </td> </tr> <tr> <th>Reporter</th> <td>iive@yahoo.com </td> </tr> <tr> <th>QA Contact</th> <td>mesa-dev@lists.freedesktop.org </td> </tr></table> <p> <div> <pre>I've traced the massive slowdown to the memcpy() in "mesa/src/gallium/auxiliary/util/u_upload_mgr.c::u_upload_data()" that seems to be used to move data from the host memory into the video card memory. The slowdown could be observed if non-SIMD version of the glibc-2.27 function is used (like the one that comes with the 32 bit Slackware-current). The system mesa3d package does not exhibit the same slowdown, but it seems to be linked to glibc-2.5. I do suspect that the slowdown is caused by memcpy() implementation that copies data backwards, starting from the end and moving to the beginning. This is likely treated as non-sequential data transfer over the PCI bus (it probably sends the full 32 bit address for every 32 bits of data). Using SSE2 memcpy seems to avoid this problem, but I have no idea if it is because it copies more data at once or because it copies forward. In my benchmarks, `perf top` showed that the problematic memcpy() consumes 25% CPU time. In a particular game benchmark, I was getting 50fps instead of 70fps. Just replacing that memcpy() with memmove() fixed the issue for me, without having to recompile and replace glibc. However I do not consider it reliable fix, as there is nothing guaranteeing that memmove() would do the right thing. I think that the correct solution would be to create a new function memcpy_to_pci() and having assembly implementation(s) that are specifically crafted to maximize PCI/PCIe throughput. The kernel has memcpy_toio/fromio(), but they don't seem to be asm optimized. I've seen MPlayer MMX optimized mem2agpcpy() in aclib_template.c .</pre> </div> </p> <hr> <span>You are receiving this mail because:</span> <ul> <li>You are the assignee for the bug.</li> <li>You are the QA Contact for the bug.</li> </ul> </body> </html>