[Mesa-dev] [PATCH 28/29] i965: Drop random 32-bit assembly implementation of memcpy().

Mon Sep 30 18:57:22 PDT 2013

On Tue, Oct 1, 2013 at 3:43 AM, Ian Romanick <idr at freedesktop.org> wrote:
> On 09/30/2013 05:47 PM, Roland Mainz wrote:
>> On Tue, Oct 1, 2013 at 2:27 AM, Ian Romanick <idr at freedesktop.org> wrote:
>>> On 09/27/2013 04:46 PM, Kenneth Graunke wrote:
>>>> This was only used for uploading batchbuffer data, and only on 32-bit
>>>> systems.  If this is actually useful, we might want to use it more
>>>> widely.  But more than likely, it isn't.
>>>
>>> This probably is still useful, alas.  The glibc memcpy wants to do an
>>> Atom-friendly backwards walk of the addresses.
>>
>> Erm... just curious: Are you sure this is done for Atom ? Originally
>> such copy-from-highest-to-lowest-address copying is (should be: "was")
>> done to enable overlapping copies... but at least POSIX mandates that
>> |memcpy()| is not required to support overlapping copies and users
>> should use |memmove()| instead in such cases (for example Solaris uses
>> the POSIX interpretation in this case... and AFAIK Apple OSX even hits
>> you with an |abort()| if you attempt an overlapping copy with
>> |memcpy()| (or |strcpy()|) (and AFAIK "valgrind" will complain about
>> such abuse of |memcpy()|/|strcpy()|/|stpcpy()|, too)).
>
> I was pretty sure it was Atom... though looking at the glibc source, the
> backward memcpy is only used on Core i3, i5, and i7 unless
> USE_AS_MEMMOVE is defined.  Hmm...

Grrrr... wrong glibc used on wrong CPU ?

Well... on Solaris that issue was solved via the loop-back filesystem:
-- snip --
$ mount | fgrep libc
/lib/libc.so.1 on /usr/lib/libc/libc_hwcap1.so.1
read/write/setuid/devices/dev=1690002 on Sat Sep 28 17:08:31 2013
-- snip --
(the system starts with a generic libc and then figures out which
optimised libc to use and then mounts the matching libc as file (!!)
over /lib/libc.so.1).

The other solution in this area is to add special CPU type handling to
the linker and runtime linker... the advantage is that you only need
one libc.so.1 binary anymore... the disadvantage is that libc.so.1,
linker and runtime linker must cooperate... and debugging is far more
difficult than using mount(1)+fgrep(1) ... ;-/

>>> For some kinds of
>>> mappings (uncached?), this breaks write combining and ruins performance.
>>
>> That more or less breaks performance _everywhere_ because automatic
>> prefetch obtains the next cache line and not the previous one.
>
> Except your out-of-order CPU is really smart, and, IIRC, that makes it
> usually not break.  I think.

I'm not sure whether the out-of-order CPUs can look that deep into the
instruction queue or whether their queue is really _that_ deep (not
even Sun's "ROCK" processor tried that). AFAIK it's more likely that
some "statistics" bits remember that the loop is going backwards...
... but that's all speculation about crazy CPU architecture.

What about a "configure" probe which checks whether |memcpy()| does or
does not support overlapping copies (see warning about Apple OSX in my
older posting) and then assume its safe to use if it does _not_
support overlapping copies ...

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) roland.mainz at nrubsig.org
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)