[Pixman] testsuite fails on power7

Thu Aug 29 14:26:44 PDT 2013

On Thu, 29 Aug 2013 16:23:49 -0400
"Lennart Sorensen" <lsorense at csclub.uwaterloo.ca> wrote:

> On Thu, Aug 29, 2013 at 10:56:49PM +0300, Siarhei Siamashka wrote:
> > On Thu, 29 Aug 2013 15:18:57 -0400
> > "Lennart Sorensen" <lsorense at csclub.uwaterloo.ca> wrote:
> > 
> > > I get crashes in the scaling and affinity tests on power7.  The crashes
> > > are always in the vmx code, so building with vmx support disabled makes
> > > the problem go away.
> > > 
> > > The error is not consistent, so my current guess is that multiple threads
> > > are running and depending on timing one thread manages to sometimes
> > > corrupt another and cause it to fail.
> > > 
> > > As far as I can tell, it doesn't fail on power5 or power6 machines,
> > > but given the interesting memory model of the powerpc and requirement
> > > for explicit syncs and barriers to ensure things have really made it to
> > > memory and other CPUs, the power7 has managed to show up bugs in glibc
> > > and gcc already where power5 and power6 and other powerpc systems never
> > > failed before.
> > > 
> > > Any suggestions on how to debug this or where to look?  Any traces or
> > > logs that would be helpful?
> > > 
> > > I am currently using version 0.26.0-4 debian package on Debian 7 (wheezy).
> > > 
> > > Interestingly, if I change the version of libc to 2.17 instead of 2.13
> > > that wheezy is using, then the problem also disappears, but again, this
> > > might just be a timing change causing this, or perhaps there is something
> > > relevant changed in the newer libc, although I haven't spotted anything
> > > suspicious looking when doing a diff so far.
> > 
> > VMX/Altivec is a bit tricky because all the vector load/store
> > operations must be aligned. For the unaligned reads/writes, pixman
> > seems to use the LOAD_VECTORS and STORE_VECTOR macros:
> 
> My understanding was that vec_lda must be aligned but vec_ld does not
> have to be aligned.

I'm not really familiar with the Altivec intrinsics. They might provide
some syntax sugar (which also might be compiler specific). But the
intrinsics are converted to the Altivec instructions in the generated
code in the end. There are two Altivec manuals (for assembly and
intrinsics) here:

http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPEM.pdf
http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf

> >     http://cgit.freedesktop.org/pixman/tree/pixman/pixman-vmx.c?id=pixman-0.30.2#n151
> > 
> > The STORE_VECTOR macro is particularly interesting because it performs
> > two stores. We can have a look at the typical combiner function, such
> > as "vmx_combine_over_u_no_mask":
> > 
> >     http://cgit.freedesktop.org/pixman/tree/pixman/pixman-vmx.c?id=pixman-0.30.2#n187
> > 
> > In the case if the destination buffer is unaligned and the width is a
> > perfect multiple of 4 pixels, I believe that we may have some writes
> > crossing the boundaries of the destination buffer.
> > 
> > Is suspect that it just reads the data outside the destination buffer,
> > modifies the parts which really belong to the destination image and
> > writes everything back (so that the chunk of memory outside the
> > destination buffer is restored by the STORE_VECTOR macro to the value
> > that it had at the time of LOAD_VECTORS invocation). Without heavy
> > multithreading this kinda works just fine. But with many concurrent
> > threads, the chunk of data beyond the destination buffer may be
> > possibly actively used by some other thread, creating a race condition.
> > 
> > That was just a guess based on the quick look at the pixman vmx code.
> > You can possibly try to experiment with overriding malloc by something
> > that allocates memory blocks with 16 bytes granularity (for both the
> > starting address and size). This would make sure that each 16 bytes
> > aligned memory chunk is never shared by multiple threads. If the crashes
> > disappear, then that's probably it. And the libc 2.17 might be perhaps
> > enforcing something like this.
> 
> Well running under valgrind shows that sometimes the LOAD_VECTORS and
> STORE_VECTOR do read and write outside the malloc area.

Reads outside the malloc area are fine. If we need to read only a single
last pixel of the image, then reading the whole 16 byte chunk it
belongs to is also fine. It is not going to cause any segfaults (the
needed bytes and the extra bytes from this 16 byte chunk all belong to
the same memory page). And merely reading obviously can't corrupt
memory. I think similar tricks are used by glibc also for x86, that's
why valgrind ships with a list of suppressions for false positives.

But any writes outside of the malloc area are really bad. Even if they
write back the same value that was read from this memory location just
a few instructions ago.

The possible solution is to add some extra code before the VMX combiner
loops to align the destination to 16 bytes boundary. Just like it is
done for SSE2:

    http://cgit.freedesktop.org/pixman/tree/pixman/pixman-sse2.c?id=pixman-0.30.2#n663

You can try to disable most of the VMX combiners by commenting out
the pointers initialization here:

   http://cgit.freedesktop.org/pixman/tree/pixman/pixman-vmx.c?id=pixman-0.30.2#n1622

For debugging purposes keep just one combiner, which can still reliably
trigger the problem. Then try to fix this problem. And then apply the
same fix to the rest of the VMX code.

> I tried just making the malloc get 16 bytes extra, but that did not solve
> the issue. It seems it has to be something more complicated than that.

We may also have troubles accessing memory before the malloc area. The
address of the allocated memory block also should be 16 bytes aligned
to workaround the problem. So just allocating 16 bytes extra is not
enough. You can try using memalign/posix_memalign to test this. In any
case, that's only a test to investigate/confirm the problem. It might
be not worth wasting time.

> I am not sure if the vec_ld is implemented in the compiler or libc,

The intrinsics are converted to assembly instructions by the compiler.

> and I can't remember if I still used the same gcc version when testing
> with libc 2.17.  I am using gcc 4.6 from Debian wheezy at the moment.
> I am pretty sure I tried with 4.7 as well with no change in behaviour.

I suspect that the only relevant difference between glibc versions
affecting this bug could be the malloc implementation.

Some other possible sources of problems are the OpenMP implementation
and TLS. But if everything works fine with VMX disabled, then they are
probably not at fault here.

-- 
Best regards,
Siarhei Siamashka