[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

Fri Aug 24 00:15:07 UTC 2018

https://bugs.freedesktop.org/show_bug.cgi?id=107670

            Bug ID: 107670
           Summary: Massive slowdown under specific memcpy implementations
                    (32bit, no-SIMD, backward copy).
           Product: Mesa
           Version: unspecified
          Hardware: x86 (IA32)
                OS: All
            Status: NEW
          Severity: normal
          Priority: medium
         Component: Other
          Assignee: mesa-dev at lists.freedesktop.org
          Reporter: iive at yahoo.com
        QA Contact: mesa-dev at lists.freedesktop.org

I've traced the massive slowdown to the memcpy() in
"mesa/src/gallium/auxiliary/util/u_upload_mgr.c::u_upload_data()" that seems to
be used to move data from the host memory into the video card memory.

The slowdown could be observed if non-SIMD version of the glibc-2.27 function
is used (like the one that comes with the 32 bit Slackware-current). The system
mesa3d package does not exhibit the same slowdown, but it seems to be linked to
glibc-2.5.

I do suspect that the slowdown is caused by memcpy() implementation that copies
data backwards, starting from the end and moving to the beginning. This is
likely treated as non-sequential data transfer over the PCI bus (it probably
sends the full 32 bit address for every 32 bits of data).
Using SSE2 memcpy seems to avoid this problem, but I have no idea if it is
because it copies more data at once or because it copies forward.

In my benchmarks, `perf top` showed that the problematic memcpy() consumes 25%
CPU time. In a particular game benchmark, I was getting 50fps instead of 70fps.

Just replacing that memcpy() with memmove() fixed the issue for me, without
having to recompile and replace glibc.
However I do not consider it reliable fix, as there is nothing guaranteeing
that memmove() would do the right thing.

I think that the correct solution would be to create a new function
memcpy_to_pci() and having assembly implementation(s) that are specifically
crafted to maximize PCI/PCIe throughput.
The kernel has memcpy_toio/fromio(), but they don't seem to be asm optimized.
I've seen MPlayer MMX optimized mem2agpcpy() in aclib_template.c .

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20180824/2c849006/attachment.html>