[Mesa-dev] [PATCH 28/29] i965: Drop random 32-bit assembly implementation of memcpy().

Tue Oct 1 06:19:34 PDT 2013

Am 01.10.2013 03:57, schrieb Roland Mainz:
> On Tue, Oct 1, 2013 at 3:43 AM, Ian Romanick <idr at freedesktop.org> wrote:
>> On 09/30/2013 05:47 PM, Roland Mainz wrote:
>>> On Tue, Oct 1, 2013 at 2:27 AM, Ian Romanick <idr at freedesktop.org> wrote:
>>>> On 09/27/2013 04:46 PM, Kenneth Graunke wrote:
>>>>> This was only used for uploading batchbuffer data, and only on 32-bit
>>>>> systems.  If this is actually useful, we might want to use it more
>>>>> widely.  But more than likely, it isn't.
>>>>
>>>> This probably is still useful, alas.  The glibc memcpy wants to do an
>>>> Atom-friendly backwards walk of the addresses.
>>>
>>> Erm... just curious: Are you sure this is done for Atom ? Originally
>>> such copy-from-highest-to-lowest-address copying is (should be: "was")
>>> done to enable overlapping copies... but at least POSIX mandates that
>>> |memcpy()| is not required to support overlapping copies and users
>>> should use |memmove()| instead in such cases (for example Solaris uses
>>> the POSIX interpretation in this case... and AFAIK Apple OSX even hits
>>> you with an |abort()| if you attempt an overlapping copy with
>>> |memcpy()| (or |strcpy()|) (and AFAIK "valgrind" will complain about
>>> such abuse of |memcpy()|/|strcpy()|/|stpcpy()|, too)).
>>
>> I was pretty sure it was Atom... though looking at the glibc source, the
>> backward memcpy is only used on Core i3, i5, and i7 unless
>> USE_AS_MEMMOVE is defined.  Hmm...
> 
> Grrrr... wrong glibc used on wrong CPU ?
> 
> Well... on Solaris that issue was solved via the loop-back filesystem:
> -- snip --
> $ mount | fgrep libc
> /lib/libc.so.1 on /usr/lib/libc/libc_hwcap1.so.1
> read/write/setuid/devices/dev=1690002 on Sat Sep 28 17:08:31 2013
> -- snip --
> (the system starts with a generic libc and then figures out which
> optimised libc to use and then mounts the matching libc as file (!!)
> over /lib/libc.so.1).
> 
> The other solution in this area is to add special CPU type handling to
> the linker and runtime linker... the advantage is that you only need
> one libc.so.1 binary anymore... the disadvantage is that libc.so.1,
> linker and runtime linker must cooperate... and debugging is far more
> difficult than using mount(1)+fgrep(1) ... ;-/
> 
>>>> For some kinds of
>>>> mappings (uncached?), this breaks write combining and ruins performance.
>>>
>>> That more or less breaks performance _everywhere_ because automatic
>>> prefetch obtains the next cache line and not the previous one.
>>
>> Except your out-of-order CPU is really smart, and, IIRC, that makes it
>> usually not break.  I think.
> 
> I'm not sure whether the out-of-order CPUs can look that deep into the
> instruction queue or whether their queue is really _that_ deep (not
> even Sun's "ROCK" processor tried that). AFAIK it's more likely that
> some "statistics" bits remember that the loop is going backwards...
> ... but that's all speculation about crazy CPU architecture.
> 

Yes, as far as I know all (?) modern cpus can recognize access patterns
in memory accesses, so they can correctly prefetch even non-adjacent
cache lines (I don't know though if the old atom would qualify, which is
only partly really a modern cpu). Not sure though why backward vs.
forward would make a difference (I would suspect the difference is
tiny). I don't think going backwards necessarily has to break write
combining neither, so that might be specific to some cpus as well.

Roland