[Pixman] Pixman Digest, Vol 98, Issue 5
刁先举
xianjudiao at gmail.com
Sat Sep 29 01:29:55 UTC 2018
hello, I have modified the patch according to your suggestio.And, when you
have time,you can review the patch. If you have any questions, you can
bring them up for me.Thank you every much!
刁先举 <xianjudiao at gmail.com> 于2018年9月25日周二 上午9:47写道:
> I have seen your suggestion,and thank you.
> 1) Now,I split the optimized code into two patches,and this patch add new
> functions to pixman-mmx.c.
> 2) For loongson-mmintrin.h, I've got rid of some redundant functions. This
> is my fault. I'm very sorry!
> 3) For "-march = loongson3a", I have asked my colleagues who are working
> on theloongson platform compiler. They said that there are no options like
> loongson* yet, but they added some new options, which have been submitted
> to the community, but have not been merge into the master branch, so we can
> wait.
> 4) For "$CFLAGS" = "-g -mabi=n64", I don't know how to modify it best.
> Can you give me some advic?,thank you very much!
>
> <pixman-request at lists.freedesktop.org> 于2018年9月20日周四 上午2:56写道:
>
>> Send Pixman mailing list submissions to
>> pixman at lists.freedesktop.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> https://lists.freedesktop.org/mailman/listinfo/pixman
>> or, via email, send a message with subject or body 'help' to
>> pixman-request at lists.freedesktop.org
>>
>> You can reach the person managing the list at
>> pixman-owner at lists.freedesktop.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Pixman digest..."
>>
>>
>> Today's Topics:
>>
>> 1. Re: [PATCH] mmx: compile on MIPS for Loongson-3A MMI
>> optimizations (Matt Turner)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Wed, 19 Sep 2018 11:56:00 -0700
>> From: Matt Turner <mattst88 at gmail.com>
>> To: xianjudiao at gmail.com
>> Cc: pixman at lists.freedesktop.org
>> Subject: Re: [Pixman] [PATCH] mmx: compile on MIPS for Loongson-3A MMI
>> optimizations
>> Message-ID:
>> <CAEdQ38F2kr6Fo7yLkSNkO_aTM=
>> Ycw5Xx9F6az0L+DHQS1fOKBw at mail.gmail.com>
>> Content-Type: text/plain; charset="UTF-8"
>>
>> On Tue, Sep 18, 2018 at 2:34 AM <xianjudiao at gmail.com> wrote:
>> >
>> > From: Xianju Diao <xianjudiao at gmail.com>
>> >
>> > make check:
>> > when I enable the USE_OPENMP, the test of 'glyph-test' and
>> 'cover-test' will failed on Loongson-3A3000.
>> > Neither of the two test examples passed without optimizing the
>> code.Maybe be multi-core synchronization
>> > of cpu bug,I will continue to debug this problem, Now, I use
>> the critical of openMP, 'glyph-test' and '
>> > cover-test' can passed.
>> >
>> > benchmark:
>> > Running cairo-perf-trace benchmark on Loongson-3A.
>> > image image16
>> > gvim 5.425 -> 5.069 5.531 ->
>> 5.236
>> > popler-reseau 2.149 -> 2.13 2.152 ->
>> 2.139
>> > swfdec-giant-steps-full 18.672 -> 8.215 33.167 ->
>> 18.28
>> > swfdec-giant-steps 7.014 -> 2.455 12.48 ->
>> 5.982
>> > xfce4-terminal-al 13.695 -> 5.241 15.703 ->
>> 5.859
>> > gonme-system-monitor 12.783 -> 7.058 12.780 ->
>> 7.104
>> > grads-heat-map 0.482 -> 0.486 0.516 ->
>> 0.514
>> > firefox-talos-svg 141.138 -> 134.621 152.495 ->
>> 159.069
>> > firefox-talos-gfx 23.119 -> 14.437 24.870 ->
>> 15.161
>> > firefox-world-map 32.018 -> 27.139 33.817 ->
>> 28.085
>> > firefox-periodic-table 12.305 -> 12.443 12.876 ->
>> 12.913
>> > evolution 7.071 -> 3.564 8.550 ->
>> 3.784
>> > firefox-planet-gnome 77.926 -> 67.526 81.554 ->
>> 65.840
>> > ocitysmap 4.934 -> 1.702 4.937 ->
>> 1.701
>> > ---
>>
>> Thanks for the patch. I will review it when I have time (I'm preparing
>> for a trip at the moment).
>>
>> I have a Loongson3 system that I have found to be unstable. I assume
>> it is due to the hardware bugs that must be worked around in gcc and
>> binutils. I have patched both of them with the patches I found in
>> https://github.com/loongson-community/binutils-gdb etc, but I still
>> have instability. I would appreciate it very much if you could offer
>> some suggestions or help in improving the stability of my system.
>>
>> Looks like there are a couple of different things happening in this
>> patch. We should try to split them up. One patch could be making the
>> assembly memcpy implementation usable on mips64. A separate patch
>> would add new functions to pixman-mmx.c.
>>
>> A few quick comments inline.
>>
>> > configure.ac | 7 +-
>> > pixman/Makefile.am | 4 +-
>> > pixman/loongson-mmintrin.h | 46 ++
>> > pixman/pixman-combine32.h | 6 +
>> > pixman/pixman-mips-dspr2-asm.h | 2 +-
>> > pixman/pixman-mips-memcpy-asm.S | 324 +++++-------
>> > pixman/pixman-mmx.c | 1088
>> ++++++++++++++++++++++++++++++++++++++-
>> > pixman/pixman-private.h | 32 +-
>> > pixman/pixman-solid-fill.c | 49 +-
>> > pixman/pixman-utils.c | 65 ++-
>> > test/Makefile.am | 2 +-
>> > test/utils.c | 8 +
>>
>> This diff stat doesn't correspond to this patch.
>>
>> > 12 files changed, 1418 insertions(+), 215 deletions(-)
>> >
>> > diff --git a/configure.ac b/configure.ac
>> > index e833e45..3e3dde5 100644
>> > --- a/configure.ac
>> > +++ b/configure.ac
>> > @@ -154,9 +154,9 @@ AC_CHECK_DECL([__amd64], [AMD64_ABI="yes"],
>> [AMD64_ABI="no"])
>> > # has set CFLAGS.
>> > if test $SUNCC = yes && \
>> > test "x$test_CFLAGS" = "x" && \
>> > - test "$CFLAGS" = "-g"
>> > + test "$CFLAGS" = "-g -mabi=n64"
>> > then
>> > - CFLAGS="-O -g"
>> > + CFLAGS="-O -g -mabi=n64"
>>
>> This isn't acceptable.
>>
>> > fi
>> >
>> > #
>> > @@ -183,6 +183,7 @@ AC_SUBST(LT_VERSION_INFO)
>> > # Check for dependencies
>> >
>> > PIXMAN_CHECK_CFLAG([-Wall])
>> > +PIXMAN_CHECK_CFLAG([-mabi=n64])
>> > PIXMAN_CHECK_CFLAG([-Wdeclaration-after-statement])
>> > PIXMAN_CHECK_CFLAG([-Wno-unused-local-typedefs])
>> > PIXMAN_CHECK_CFLAG([-fno-strict-aliasing])
>> > @@ -273,7 +274,7 @@ dnl
>> ===========================================================================
>> > dnl Check for Loongson Multimedia Instructions
>> >
>> > if test "x$LS_CFLAGS" = "x" ; then
>> > - LS_CFLAGS="-march=loongson2f"
>> > + LS_CFLAGS="-march=loongson3a"
>>
>> Also not acceptable. I see that recent gcc and binutils have gotten
>> new options for enabling MMI separately from -march=loongson*. Maybe
>> we could use those if available.
>>
>> I'm not sure there is currently a good solution. Let me think about it.
>>
>> > fi
>> >
>> > have_loongson_mmi=no
>> > diff --git a/pixman/Makefile.am b/pixman/Makefile.am
>> > index 581b6f6..e3a080c 100644
>> > --- a/pixman/Makefile.am
>> > +++ b/pixman/Makefile.am
>> > @@ -122,7 +122,7 @@ libpixman_mips_dspr2_la_SOURCES = \
>> > pixman-mips-dspr2.h \
>> > pixman-mips-dspr2-asm.S \
>> > pixman-mips-dspr2-asm.h \
>> > - pixman-mips-memcpy-asm.S
>> > + #pixman-mips-memcpy-asm.S
>>
>> Can't do this.
>>
>> > libpixman_1_la_LIBADD += libpixman-mips-dspr2.la
>> >
>> > ASM_CFLAGS_mips_dspr2=
>> > @@ -131,7 +131,7 @@ endif
>> > # loongson code
>> > if USE_LOONGSON_MMI
>> > noinst_LTLIBRARIES += libpixman-loongson-mmi.la
>> > -libpixman_loongson_mmi_la_SOURCES = pixman-mmx.c loongson-mmintrin.h
>> > +libpixman_loongson_mmi_la_SOURCES = pixman-mmx.c loongson-mmintrin.h
>> pixman-mips-memcpy-asm.S
>> > libpixman_loongson_mmi_la_CFLAGS = $(LS_CFLAGS)
>> > libpixman_1_la_LDFLAGS += $(LS_LDFLAGS)
>> > libpixman_1_la_LIBADD += libpixman-loongson-mmi.la
>> > diff --git a/pixman/loongson-mmintrin.h b/pixman/loongson-mmintrin.h
>> > index 086c6e0..f049463 100644
>> > --- a/pixman/loongson-mmintrin.h
>> > +++ b/pixman/loongson-mmintrin.h
>> > @@ -89,6 +89,17 @@ _mm_adds_pu8 (__m64 __m1, __m64 __m2)
>> > }
>> >
>> > extern __inline __m64 __attribute__((__gnu_inline__,
>> __always_inline__, __artificial__))
>> > +_mm_andn_si64 (__m64 __m1, __m64 __m2)
>>
>> Doesn't seem to be used.
>>
>> > +{
>> > + __m64 ret;
>> > + asm("pandn %0, %1, %2\n\t"
>> > + : "=f" (ret)
>> > + : "f" (__m1), "f"(__m2)
>> > + );
>> > + return ret;
>> > +}
>> > +
>> > +extern __inline __m64 __attribute__((__gnu_inline__,
>> __always_inline__, __artificial__))
>> > _mm_and_si64 (__m64 __m1, __m64 __m2)
>> > {
>> > __m64 ret;
>> > @@ -100,6 +111,17 @@ _mm_and_si64 (__m64 __m1, __m64 __m2)
>> > }
>> >
>> > extern __inline __m64 __attribute__((__gnu_inline__,
>> __always_inline__, __artificial__))
>> > +_mm_cmpeq_pi16 (__m64 __m1, __m64 __m2)
>>
>> Doesn't seem to be used.
>>
>> > +{
>> > + __m64 ret;
>>
>> Whitespace mistake.
>>
>> > + asm("pcmpeqh %0, %1, %2\n\t"
>> > + : "=f" (ret)
>> > + : "f" (__m1), "f" (__m2)
>> > + );
>> > + return ret;
>> > +}
>> > +
>> > +extern __inline __m64 __attribute__((__gnu_inline__,
>> __always_inline__, __artificial__))
>> > _mm_cmpeq_pi32 (__m64 __m1, __m64 __m2)
>> > {
>> > __m64 ret;
>> > @@ -110,6 +132,30 @@ _mm_cmpeq_pi32 (__m64 __m1, __m64 __m2)
>> > return ret;
>> > }
>> >
>> > +extern __inline __m64 __attribute__((__gnu_inline__,
>> __always_inline__, __artificial__))
>> > +loongson_fand (__m64 __m1, __m64 __m2)
>>
>> Doesn't seem to be used.
>>
>> > +{
>> > + __m64 ret;
>> > + asm("fand %0, %1, %2\n\t"
>> > + : "=f" (ret)
>> > + : "f" (__m1), "f" (__m2)
>> > + );
>> > + return ret;
>> > +}
>> > +
>> > +
>> > +extern __inline __m64 __attribute__((__gnu_inline__,
>> __always_inline__, __artificial__))
>> > +_mm_cmpgt_pi16 (__m64 __m1, __m64 __m2)
>> > +{
>> > + __m64 ret;
>> > + asm("pcmpgth %0, %1, %2\n\t"
>> > + : "=f" (ret)
>> > + : "f" (__m1), "f" (__m2)
>> > + );
>> > + return ret;
>> > +}
>> > +
>> > +
>> > extern __inline void __attribute__((__gnu_inline__, __always_inline__,
>> __artificial__))
>> > _mm_empty (void)
>> > {
>> > diff --git a/pixman/pixman-combine32.h b/pixman/pixman-combine32.h
>> > index cdd56a6..27f62d9 100644
>> > --- a/pixman/pixman-combine32.h
>> > +++ b/pixman/pixman-combine32.h
>> > @@ -14,6 +14,12 @@
>> > #define RB_ONE_HALF 0x800080
>> > #define RB_MASK_PLUS_ONE 0x10000100
>> >
>> > +#define RGB_MASK 0xffffff
>> > +#define RGB_DMASK 0xffffffffffffULL
>> > +#define R_DMASK 0x0000ffff00000000ULL
>> > +#define G_DMASK 0x00000000ffff0000ULL
>> > +#define B_DMASK 0x000000000000ffffULL
>> > +
>> > #define ALPHA_8(x) ((x) >> A_SHIFT)
>> > #define RED_8(x) (((x) >> R_SHIFT) & MASK)
>> > #define GREEN_8(x) (((x) >> G_SHIFT) & MASK)
>> > diff --git a/pixman/pixman-mips-dspr2-asm.h
>> b/pixman/pixman-mips-dspr2-asm.h
>> > index e238566..63d7d96 100644
>> > --- a/pixman/pixman-mips-dspr2-asm.h
>> > +++ b/pixman/pixman-mips-dspr2-asm.h
>> > @@ -77,7 +77,7 @@
>> > .ent symbol, 0; \
>> > symbol: .frame sp, 0, ra; \
>> > .set push; \
>> > - .set arch=mips32r2; \
>> > + .set arch=mips64r2; \
>>
>> Can't do this.
>>
>> > .set noreorder; \
>> > .set noat;
>> >
>> > diff --git a/pixman/pixman-mips-memcpy-asm.S
>> b/pixman/pixman-mips-memcpy-asm.S
>> > index 9ad6da5..a140191 100644
>> > --- a/pixman/pixman-mips-memcpy-asm.S
>> > +++ b/pixman/pixman-mips-memcpy-asm.S
>> > @@ -54,19 +54,20 @@ LEAF_MIPS32R2(pixman_mips_fast_memcpy)
>> >
>> > /* Test if the src and dst are word-aligned, or can be made
>> word-aligned */
>> > xor t8, a1, a0
>> > - andi t8, t8, 0x3 /* t8 is a0/a1
>> word-displacement */
>> > + andi t8, t8, 0x7 /* t8 is a0/a1
>> word-displacement */
>> >
>> > bne t8, zero, $unaligned
>> > negu a3, a0
>> >
>> > - andi a3, a3, 0x3 /* we need to copy a3 bytes to make
>> a0/a1 aligned */
>> > + andi a3, a3, 0x7 /* we need to copy a3 bytes to make
>> a0/a1 aligned */
>> > beq a3, zero, $chk16w /* when a3=0 then the dst (a0)
>> is word-aligned */
>> > subu a2, a2, a3 /* now a2 is the remining bytes count */
>> >
>> > - LWHI t8, 0(a1)
>> > - addu a1, a1, a3
>> > - SWHI t8, 0(a0)
>> > - addu a0, a0, a3
>> > + ld t8, 0(a1)
>> > + daddu a1, a1, a3
>> > + sdl t8, 7(a0)
>> > + sdr t8, 0(a0)
>> > + daddu a0, a0, a3
>> >
>> > /* Now the dst/src are mutually word-aligned with word-aligned
>> addresses */
>> > $chk16w: andi t8, a2, 0x3f /* any whole 64-byte chunks? */
>> > @@ -76,9 +77,9 @@ $chk16w: andi t8, a2, 0x3f /* any whole
>> 64-byte chunks? */
>> > /* There will be at most 1 32-byte
>> chunk after it */
>> > subu a3, a2, t8 /* subtract from a2 the reminder */
>> > /* Here a3 counts bytes in 16w chunks
>> */
>> > - addu a3, a0, a3 /* Now a3 is the final dst after
>> 64-byte chunks */
>> > + daddu a3, a0, a3 /* Now a3 is the final dst after
>> 64-byte chunks */
>> >
>> > - addu t0, a0, a2 /* t0 is the "past the end" address */
>> > + daddu t0, a0, a2 /* t0 is the "past the end" address */
>> >
>> > /*
>> > * When in the loop we exercise "pref 30, x(a0)", the a0+x should not
>> be past
>> > @@ -89,119 +90,98 @@ $chk16w: andi t8, a2, 0x3f /* any whole
>> 64-byte chunks? */
>> > */
>> > subu t9, t0, 160 /* t9 is the "last safe pref 30,
>> 128(a0)" address */
>> >
>> > - pref 0, 0(a1) /* bring the first line of src,
>> addr 0 */
>> > - pref 0, 32(a1) /* bring the second line of src, addr
>> 32 */
>> > - pref 0, 64(a1) /* bring the third line of src, addr 64
>> */
>> > - pref 30, 32(a0) /* safe, as we have at least 64 bytes
>> ahead */
>> > + lw $0, 0(a1) /* bring the first line of src, addr 0
>> */
>> > + lw $0, 32(a1) /* bring the second line of src, addr
>> 32 */
>> > + lw $0, 64(a1) /* bring the third line of src, addr 64
>> */
>> > + lw $0, 32(a0) /* safe, as we have at least 64 bytes
>> ahead */
>> > /* In case the a0 > t9 don't use "pref 30" at all */
>> > sgtu v1, a0, t9
>> > bgtz v1, $loop16w /* skip "pref 30, 64(a0)" for too short
>> arrays */
>> > nop
>> > /* otherwise, start with using pref30 */
>> > - pref 30, 64(a0)
>> > + lw $0, 64(a0)
>> > $loop16w:
>> > - pref 0, 96(a1)
>> > - lw t0, 0(a1)
>> > + lw $0, 96(a1)
>> > + ld t0, 0(a1)
>> > bgtz v1, $skip_pref30_96 /* skip "pref 30, 96(a0)" */
>> > - lw t1, 4(a1)
>> > - pref 30, 96(a0) /* continue setting up the dest, addr 96 */
>> > + lw $0, 96(a0) /* continue setting up the dest, addr 96 */
>> > $skip_pref30_96:
>> > - lw t2, 8(a1)
>> > - lw t3, 12(a1)
>> > - lw t4, 16(a1)
>> > - lw t5, 20(a1)
>> > - lw t6, 24(a1)
>> > - lw t7, 28(a1)
>> > - pref 0, 128(a1) /* bring the next lines of src, addr 128
>> */
>> > -
>> > - sw t0, 0(a0)
>> > - sw t1, 4(a0)
>> > - sw t2, 8(a0)
>> > - sw t3, 12(a0)
>> > - sw t4, 16(a0)
>> > - sw t5, 20(a0)
>> > - sw t6, 24(a0)
>> > - sw t7, 28(a0)
>> > -
>> > - lw t0, 32(a1)
>> > + ld t2, 8(a1)
>> > + ld t4, 16(a1)
>> > + ld t6, 24(a1)
>> > + lw $0, 128(a1) /* bring the next lines of src, addr 128
>> */
>> > + lw $0, 0x0(a0)
>> > +
>> > + sd t0, 0(a0)
>> > + sd t2, 8(a0)
>> > + sd t4, 16(a0)
>> > + sd t6, 24(a0)
>> > +
>> > + ld t0, 32(a1)
>> > bgtz v1, $skip_pref30_128 /* skip "pref 30, 128(a0)" */
>> > - lw t1, 36(a1)
>> > - pref 30, 128(a0) /* continue setting up the dest, addr 128
>> */
>> > + lw $0, 128(a0) /* continue setting up the dest, addr 128 */
>> > $skip_pref30_128:
>> > - lw t2, 40(a1)
>> > - lw t3, 44(a1)
>> > - lw t4, 48(a1)
>> > - lw t5, 52(a1)
>> > - lw t6, 56(a1)
>> > - lw t7, 60(a1)
>> > - pref 0, 160(a1) /* bring the next lines of src, addr 160
>> */
>> > -
>> > - sw t0, 32(a0)
>> > - sw t1, 36(a0)
>> > - sw t2, 40(a0)
>> > - sw t3, 44(a0)
>> > - sw t4, 48(a0)
>> > - sw t5, 52(a0)
>> > - sw t6, 56(a0)
>> > - sw t7, 60(a0)
>> > -
>> > - addiu a0, a0, 64 /* adding 64 to dest */
>> > + ld t2, 40(a1)
>> > + ld t4, 48(a1)
>> > + ld t6, 56(a1)
>> > + lw $0, 160(a1) /* bring the next lines of src, addr 160
>> */
>> > + lw $0, 0x32(a0)
>> > +
>> > + sd t0, 32(a0)
>> > + sd t2, 40(a0)
>> > + sd t4, 48(a0)
>> > + sd t6, 56(a0)
>> > +
>> > + daddiu a0, a0, 64 /* adding 64 to dest */
>> > sgtu v1, a0, t9
>> > bne a0, a3, $loop16w
>> > - addiu a1, a1, 64 /* adding 64 to src */
>> > + daddiu a1, a1, 64 /* adding 64 to src */
>> > move a2, t8
>> >
>> > /* Here we have src and dest word-aligned but less than 64-bytes to go
>> */
>> >
>> > $chk8w:
>> > - pref 0, 0x0(a1)
>> > + lw $0, 0x0(a1)
>> > andi t8, a2, 0x1f /* is there a 32-byte chunk? */
>> > /* the t8 is the reminder count past
>> 32-bytes */
>> > beq a2, t8, $chk1w /* when a2=t8, no 32-byte chunk */
>> > nop
>> >
>> > - lw t0, 0(a1)
>> > - lw t1, 4(a1)
>> > - lw t2, 8(a1)
>> > - lw t3, 12(a1)
>> > - lw t4, 16(a1)
>> > - lw t5, 20(a1)
>> > - lw t6, 24(a1)
>> > - lw t7, 28(a1)
>> > - addiu a1, a1, 32
>> > -
>> > - sw t0, 0(a0)
>> > - sw t1, 4(a0)
>> > - sw t2, 8(a0)
>> > - sw t3, 12(a0)
>> > - sw t4, 16(a0)
>> > - sw t5, 20(a0)
>> > - sw t6, 24(a0)
>> > - sw t7, 28(a0)
>> > - addiu a0, a0, 32
>> > + ld t0, 0(a1)
>> > + ld t2, 8(a1)
>> > + ld t4, 16(a1)
>> > + ld t6, 24(a1)
>> > + lw $0, 0x0(a0)
>> > + daddiu a1, a1, 32
>> > +
>> > + sd t0, 0(a0)
>> > + sd t2, 8(a0)
>> > + sd t4, 16(a0)
>> > + sd t6, 24(a0)
>> > + daddiu a0, a0, 32
>> >
>> > $chk1w:
>> > andi a2, t8, 0x3 /* now a2 is the reminder past 1w
>> chunks */
>> > beq a2, t8, $last8
>> > subu a3, t8, a2 /* a3 is count of bytes in 1w chunks */
>> > - addu a3, a0, a3 /* now a3 is the dst address past the
>> 1w chunks */
>> > + daddu a3, a0, a3 /* now a3 is the dst address past the
>> 1w chunks */
>> >
>> > /* copying in words (4-byte chunks) */
>> > $wordCopy_loop:
>> > lw t3, 0(a1) /* the first t3 may be equal t0 ...
>> optimize? */
>> > - addiu a1, a1, 4
>> > - addiu a0, a0, 4
>> > + daddiu a1, a1, 4
>> > + daddiu a0, a0, 4
>> > bne a0, a3, $wordCopy_loop
>> > sw t3, -4(a0)
>> >
>> > /* For the last (<8) bytes */
>> > $last8:
>> > blez a2, leave
>> > - addu a3, a0, a2 /* a3 is the last dst address */
>> > + daddu a3, a0, a2 /* a3 is the last dst address */
>> > $last8loop:
>> > lb v1, 0(a1)
>> > - addiu a1, a1, 1
>> > - addiu a0, a0, 1
>> > + daddiu a1, a1, 1
>> > + daddiu a0, a0, 1
>> > bne a0, a3, $last8loop
>> > sb v1, -1(a0)
>> >
>> > @@ -214,15 +194,16 @@ leave: j ra
>> >
>> > $unaligned:
>> > /* got here with a3="negu a0" */
>> > - andi a3, a3, 0x3 /* test if the a0 is word aligned */
>> > + andi a3, a3, 0x7 /* test if the a0 is word aligned */
>> > beqz a3, $ua_chk16w
>> > subu a2, a2, a3 /* bytes left after initial a3 bytes */
>> >
>> > - LWHI v1, 0(a1)
>> > - LWLO v1, 3(a1)
>> > - addu a1, a1, a3 /* a3 may be here 1, 2 or 3 */
>> > - SWHI v1, 0(a0)
>> > - addu a0, a0, a3 /* below the dst will be word aligned
>> (NOTE1) */
>> > + ldl v1, 7(a1)
>> > + ldr v1, 0(a1)
>> > + daddu a1, a1, a3 /* a3 may be here 1, 2 or 3 */
>> > + sdl v1, 7(a0)
>> > + sdr v1, 0(a0)
>> > + daddu a0, a0, a3 /* below the dst will be word aligned
>> (NOTE1) */
>> >
>> > $ua_chk16w: andi t8, a2, 0x3f /* any whole 64-byte chunks? */
>> > /* t8 is the byte count after 64-byte
>> chunks */
>> > @@ -230,149 +211,116 @@ $ua_chk16w: andi t8, a2, 0x3f /* any
>> whole 64-byte chunks? */
>> > /* There will be at most 1 32-byte
>> chunk after it */
>> > subu a3, a2, t8 /* subtract from a2 the reminder */
>> > /* Here a3 counts bytes in 16w chunks
>> */
>> > - addu a3, a0, a3 /* Now a3 is the final dst after
>> 64-byte chunks */
>> > + daddu a3, a0, a3 /* Now a3 is the final dst after
>> 64-byte chunks */
>> >
>> > - addu t0, a0, a2 /* t0 is the "past the end" address */
>> > + daddu t0, a0, a2 /* t0 is the "past the end" address */
>> >
>> > subu t9, t0, 160 /* t9 is the "last safe pref 30,
>> 128(a0)" address */
>> >
>> > - pref 0, 0(a1) /* bring the first line of src,
>> addr 0 */
>> > - pref 0, 32(a1) /* bring the second line of src, addr
>> 32 */
>> > - pref 0, 64(a1) /* bring the third line of src, addr 64
>> */
>> > - pref 30, 32(a0) /* safe, as we have at least 64 bytes
>> ahead */
>> > + lw $0, 0(a1) /* bring the first line of src, addr 0
>> */
>> > + lw $0, 32(a1) /* bring the second line of src, addr
>> 32 */
>> > + lw $0, 64(a1) /* bring the third line of src, addr 64
>> */
>> > + lw $0, 32(a0) /* safe, as we have at least 64 bytes
>> ahead */
>> > /* In case the a0 > t9 don't use "pref 30" at all */
>> > sgtu v1, a0, t9
>> > bgtz v1, $ua_loop16w /* skip "pref 30, 64(a0)" for too short
>> arrays */
>> > nop
>> > /* otherwise, start with using pref30 */
>> > - pref 30, 64(a0)
>> > + lw $0, 64(a0)
>> > $ua_loop16w:
>> > - pref 0, 96(a1)
>> > - LWHI t0, 0(a1)
>> > - LWLO t0, 3(a1)
>> > - LWHI t1, 4(a1)
>> > + lw $0, 96(a1)
>> > + ldl t0, 7(a1)
>> > + ldr t0, 0(a1)
>> > bgtz v1, $ua_skip_pref30_96
>> > - LWLO t1, 7(a1)
>> > - pref 30, 96(a0) /* continue setting up the dest, addr 96 */
>> > + lw $0, 96(a0) /* continue setting up the dest, addr 96 */
>> > $ua_skip_pref30_96:
>> > - LWHI t2, 8(a1)
>> > - LWLO t2, 11(a1)
>> > - LWHI t3, 12(a1)
>> > - LWLO t3, 15(a1)
>> > - LWHI t4, 16(a1)
>> > - LWLO t4, 19(a1)
>> > - LWHI t5, 20(a1)
>> > - LWLO t5, 23(a1)
>> > - LWHI t6, 24(a1)
>> > - LWLO t6, 27(a1)
>> > - LWHI t7, 28(a1)
>> > - LWLO t7, 31(a1)
>> > - pref 0, 128(a1) /* bring the next lines of src, addr 128
>> */
>> > -
>> > - sw t0, 0(a0)
>> > - sw t1, 4(a0)
>> > - sw t2, 8(a0)
>> > - sw t3, 12(a0)
>> > - sw t4, 16(a0)
>> > - sw t5, 20(a0)
>> > - sw t6, 24(a0)
>> > - sw t7, 28(a0)
>> > -
>> > - LWHI t0, 32(a1)
>> > - LWLO t0, 35(a1)
>> > - LWHI t1, 36(a1)
>> > + ldl t2, 15(a1)
>> > + ldr t2, 8(a1)
>> > + ldl t4, 23(a1)
>> > + ldr t4, 16(a1)
>> > + ldl t6, 31(a1)
>> > + ldr t6, 24(a1)
>> > + lw $0, 128(a1) /* bring the next lines of src, addr 128
>> */
>> > + lw $0, 0(a0)
>> > +
>> > + sd t0, 0(a0)
>> > + sd t2, 8(a0)
>> > + sd t4, 16(a0)
>> > + sd t6, 24(a0)
>> > +
>> > + ldl t0, 39(a1)
>> > + ldr t0, 32(a1)
>> > bgtz v1, $ua_skip_pref30_128
>> > - LWLO t1, 39(a1)
>> > - pref 30, 128(a0) /* continue setting up the dest, addr 128
>> */
>> > + lw $0, 128(a0) /* continue setting up the dest, addr 128 */
>> > $ua_skip_pref30_128:
>> > - LWHI t2, 40(a1)
>> > - LWLO t2, 43(a1)
>> > - LWHI t3, 44(a1)
>> > - LWLO t3, 47(a1)
>> > - LWHI t4, 48(a1)
>> > - LWLO t4, 51(a1)
>> > - LWHI t5, 52(a1)
>> > - LWLO t5, 55(a1)
>> > - LWHI t6, 56(a1)
>> > - LWLO t6, 59(a1)
>> > - LWHI t7, 60(a1)
>> > - LWLO t7, 63(a1)
>> > - pref 0, 160(a1) /* bring the next lines of src, addr 160
>> */
>> > -
>> > - sw t0, 32(a0)
>> > - sw t1, 36(a0)
>> > - sw t2, 40(a0)
>> > - sw t3, 44(a0)
>> > - sw t4, 48(a0)
>> > - sw t5, 52(a0)
>> > - sw t6, 56(a0)
>> > - sw t7, 60(a0)
>> > -
>> > - addiu a0, a0, 64 /* adding 64 to dest */
>> > + ldl t2, 47(a1)
>> > + ldr t2, 40(a1)
>> > + ldl t4, 55(a1)
>> > + ldr t4, 48(a1)
>> > + ldl t6, 63(a1)
>> > + ldr t6, 56(a1)
>> > + lw $0, 32(a0)
>> > + lw $0, 160(a1) /* bring the next lines of src, addr 160
>> */
>> > +
>> > + sd t0, 32(a0)
>> > + sd t2, 40(a0)
>> > + sd t4, 48(a0)
>> > + sd t6, 56(a0)
>> > +
>> > + daddiu a0, a0, 64 /* adding 64 to dest */
>> > sgtu v1, a0, t9
>> > bne a0, a3, $ua_loop16w
>> > - addiu a1, a1, 64 /* adding 64 to src */
>> > + daddiu a1, a1, 64 /* adding 64 to src */
>> > move a2, t8
>> >
>> > /* Here we have src and dest word-aligned but less than 64-bytes to go
>> */
>> >
>> > $ua_chk8w:
>> > - pref 0, 0x0(a1)
>> > + lw $0, 0x0(a1)
>> > andi t8, a2, 0x1f /* is there a 32-byte chunk? */
>> > /* the t8 is the reminder count */
>> > beq a2, t8, $ua_chk1w /* when a2=t8, no 32-byte chunk
>> */
>> >
>> > - LWHI t0, 0(a1)
>> > - LWLO t0, 3(a1)
>> > - LWHI t1, 4(a1)
>> > - LWLO t1, 7(a1)
>> > - LWHI t2, 8(a1)
>> > - LWLO t2, 11(a1)
>> > - LWHI t3, 12(a1)
>> > - LWLO t3, 15(a1)
>> > - LWHI t4, 16(a1)
>> > - LWLO t4, 19(a1)
>> > - LWHI t5, 20(a1)
>> > - LWLO t5, 23(a1)
>> > - LWHI t6, 24(a1)
>> > - LWLO t6, 27(a1)
>> > - LWHI t7, 28(a1)
>> > - LWLO t7, 31(a1)
>> > - addiu a1, a1, 32
>> > -
>> > - sw t0, 0(a0)
>> > - sw t1, 4(a0)
>> > - sw t2, 8(a0)
>> > - sw t3, 12(a0)
>> > - sw t4, 16(a0)
>> > - sw t5, 20(a0)
>> > - sw t6, 24(a0)
>> > - sw t7, 28(a0)
>> > - addiu a0, a0, 32
>> > + ldl t0, 7(a1)
>> > + ldr t0, 0(a1)
>> > + ldl t2, 15(a1)
>> > + ldr t2, 8(a1)
>> > + ldl t4, 23(a1)
>> > + ldr t4, 16(a1)
>> > + ldl t6, 31(a1)
>> > + ldr t6, 24(a1)
>> > + lw $0, 0x0(a0)
>> > + daddiu a1, a1, 32
>> > +
>> > + sd t0, 0(a0)
>> > + sd t2, 8(a0)
>> > + sd t4, 16(a0)
>> > + sd t6, 24(a0)
>> > + daddiu a0, a0, 32
>> >
>> > $ua_chk1w:
>> > andi a2, t8, 0x3 /* now a2 is the reminder past 1w
>> chunks */
>> > beq a2, t8, $ua_smallCopy
>> > subu a3, t8, a2 /* a3 is count of bytes in 1w chunks */
>> > - addu a3, a0, a3 /* now a3 is the dst address past the
>> 1w chunks */
>> > + daddu a3, a0, a3 /* now a3 is the dst address past the
>> 1w chunks */
>> >
>> > /* copying in words (4-byte chunks) */
>> > $ua_wordCopy_loop:
>> > LWHI v1, 0(a1)
>> > LWLO v1, 3(a1)
>> > - addiu a1, a1, 4
>> > - addiu a0, a0, 4 /* note: dst=a0 is word aligned
>> here, see NOTE1 */
>> > + daddiu a1, a1, 4
>> > + daddiu a0, a0, 4 /* note: dst=a0 is word aligned
>> here, see NOTE1 */
>> > bne a0, a3, $ua_wordCopy_loop
>> > sw v1, -4(a0)
>> >
>> > /* Now less than 4 bytes (value in a2) left to copy */
>> > $ua_smallCopy:
>> > beqz a2, leave
>> > - addu a3, a0, a2 /* a3 is the last dst address */
>> > + daddu a3, a0, a2 /* a3 is the last dst address */
>> > $ua_smallCopy_loop:
>> > lb v1, 0(a1)
>> > - addiu a1, a1, 1
>> > - addiu a0, a0, 1
>> > + daddiu a1, a1, 1
>> > + daddiu a0, a0, 1
>> > bne a0, a3, $ua_smallCopy_loop
>> > sb v1, -1(a0)
>> >
>> > diff --git a/pixman/pixman-mmx.c b/pixman/pixman-mmx.c
>> > index dec3974..edbf16b 100644
>> > --- a/pixman/pixman-mmx.c
>> > +++ b/pixman/pixman-mmx.c
>> > @@ -59,6 +59,71 @@ _mm_empty (void)
>> > }
>> > #endif
>> >
>> > +#define COMBINE_A_OUT 1
>> > +#define COMBINE_A_IN 2
>> > +#define COMBINE_B_OUT 4
>> > +#define COMBINE_B_IN 8
>> > +
>> > +#define COMBINE_CLEAR 0
>> > +#define COMBINE_A (COMBINE_A_OUT | COMBINE_A_IN)
>> > +#define COMBINE_B (COMBINE_B_OUT | COMBINE_B_IN)
>> > +#define COMBINE_A_OVER (COMBINE_A_OUT | COMBINE_B_OUT | COMBINE_A_IN)
>> > +#define COMBINE_B_OVER (COMBINE_A_OUT | COMBINE_B_OUT | COMBINE_B_IN)
>> > +#define COMBINE_A_ATOP (COMBINE_B_OUT | COMBINE_A_IN)
>> > +#define COMBINE_B_ATOP (COMBINE_A_OUT | COMBINE_B_IN)
>> > +#define COMBINE_XOR (COMBINE_A_OUT | COMBINE_B_OUT)
>> > +
>> > +/* no SIMD instructions for div, so leave it alone
>> > + * portion covered by a but not b
>> > + * min (1, (1-b) / a)
>> > + */
>> > +static uint8_t
>> > +combine_disjoint_out_part (uint8_t a, uint8_t b)
>> > +{
>> > +
>> > + b = ~b;
>> > + if (b >= a)
>> > + return MASK;
>> > + return DIV_UN8 (b, a);
>> > +}
>> > +
>> > +/* portion covered by both a and b
>> > + * max (1-(1-b)/a, 0)
>> > + */
>> > +static uint8_t
>> > +combine_disjoint_in_part (uint8_t a, uint8_t b)
>> > +{
>> > +
>> > + b = ~b;
>> > + if (b >= a)
>> > + return 0;
>> > + return ~DIV_UN8(b, a);
>> > +}
>> > +
>> > +/* portion covered by a but not b
>> > + * max (1-b/a ,0)
>> > + * */
>> > +static uint8_t
>> > +combine_conjoint_out_part (uint8_t a, uint8_t b)
>> > +{
>> > +
>> > + if (b >= a)
>> > + return 0x00;
>> > + return ~DIV_UN8(b, a);
>> > +}
>> > +
>> > +/* portion covered by both a and b
>> > + * min (1, b/a)
>> > + */
>> > +static uint8_t
>> > +combine_conjoint_in_part (uint8_t a, uint8_t b)
>> > +{
>> > +
>> > + if (b >= a)
>> > + return MASK;
>> > + return DIV_UN8 (b, a);
>> > +}
>> > +
>> > #ifdef USE_X86_MMX
>> > # if (defined(__SUNPRO_C) || defined(_MSC_VER) || defined(_WIN64))
>> > # include <xmmintrin.h>
>> > @@ -78,7 +143,8 @@ _mm_movemask_pi8 (__m64 __A)
>> >
>> > return ret;
>> > }
>> > -
>> > +#define __OPTIMIZE__
>> > +#ifdef __OPTIMIZE__
>> > extern __inline __m64 __attribute__((__gnu_inline__,
>> __always_inline__, __artificial__))
>> > _mm_mulhi_pu16 (__m64 __A, __m64 __B)
>> > {
>> > @@ -88,7 +154,7 @@ _mm_mulhi_pu16 (__m64 __A, __m64 __B)
>> > );
>> > return __A;
>> > }
>> > -
>> > +#else
>> > # define _mm_shuffle_pi16(A, N)
>> \
>> > ({
>> \
>> > __m64 ret;
>> \
>> > @@ -102,7 +168,7 @@ _mm_mulhi_pu16 (__m64 __A, __m64 __B)
>> > })
>> > # endif
>> > #endif
>> > -
>> > +#endif
>> > #ifndef _MSC_VER
>> > #define _MM_SHUFFLE(fp3,fp2,fp1,fp0) \
>> > (((fp3) << 6) | ((fp2) << 4) | ((fp1) << 2) | (fp0))
>> > @@ -710,6 +776,34 @@ combine (const uint32_t *src, const uint32_t *mask)
>> > return vsrc;
>> > }
>> >
>> > +static force_inline void
>> > +mmx_combine_mask_ca(const uint32_t *src, const uint32_t *mask, __m64
>> *s64, __m64 *m64)
>> > +{
>> > + __m64 res, tmp;
>> > +
>> > + if(!(*mask))
>> > + {
>> > + *s64 = 0;
>> > + *m64 = 0;
>> > + return;
>> > + }
>> > +
>> > + *s64 = load8888(src);
>> > +
>> > + if (*mask == ~0)
>> > + {
>> > + *m64 = expand_alpha(*s64);
>> > + return;
>> > + }
>> > +
>> > + *m64 = load8888(mask);
>> > +
>> > + res = pix_multiply(*s64, *m64);
>> > + tmp = expand_alpha(*s64);
>> > + *s64 = res;
>> > + *m64 = pix_multiply(*m64, tmp);
>> > +}
>> > +
>> > static force_inline __m64
>> > core_combine_over_u_pixel_mmx (__m64 vsrc, __m64 vdst)
>> > {
>> > @@ -729,6 +823,39 @@ core_combine_over_u_pixel_mmx (__m64 vsrc, __m64
>> vdst)
>> > }
>> >
>> > static void
>> > +mmx_combine_disjoint_over_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + uint32_t *end = dest + width;
>> > + uint32_t s32;
>> > + uint64_t sa64;
>> > + __m64 s64, d64;
>> > +
>> > + while (dest < end)
>> > + {
>> > + s64 = combine (src, mask);
>> > +
>> > + if (s64)
>> > + {
>> > + store8888(&s32, s64);
>> > + sa64 = combine_disjoint_out_part (*dest >> A_SHIFT,
>> s32 >> A_SHIFT);
>> > + d64 = pix_add (pix_multiply (load8888
>> (dest),expand_alpha_rev ((*(__m64*)&sa64))), s64);
>> > + store8888 (dest, d64);
>> > + }
>> > +
>> > + ++dest;
>> > + ++src;
>> > + if (mask)
>> > + ++mask;
>> > +
>> > + }
>> > +}
>> > +
>> > +static void
>> > mmx_combine_over_u (pixman_implementation_t *imp,
>> > pixman_op_t op,
>> > uint32_t * dest,
>> > @@ -1062,7 +1189,294 @@ mmx_combine_saturate_u (pixman_implementation_t
>> *imp,
>> > }
>> > _mm_empty ();
>> > }
>> > +/* In functions such as ‘combine_conjoint_gerneral_u’, there are
>> multiple branchs,determined by the parameter 'combine'.
>> > + * and this value will not change during functions operations,so it is
>> not necessary to judge each value in the origin
>> > + * code. Can be judged at function entrance,and set the corresponding
>> function pointer,can be called directly later.
>> > + */
>> > +#define DEF_FUNC_ZERO_MASK(type, zm, suffix, res)
>> \
>> > + static type inline combine_joint_ ##zm## _ ##suffix( type sa, type
>> da, type io_flag) \
>> > + {
>> \
>> > + return res;
>> \
>> > + }
>> > +
>> > +/* 'conjoint' is same code structure as 'disjoint',the funtion name is
>> different,set this macro to generate the corresponding
>> > + * function.The order of parameter is different,which is determined by
>> 'io_flag',with '0' for 'in_part' and '1' for 'out_part'.
>> > + */
>> > +#define DEF_FUNC_COMBINE_JOINT_U(cd, io)
>> \
>> > + static uint8_t inline combine_ ##cd## joint_ ##io##
>> _part_u(uint8_t sa, uint8_t da, uint8_t io_flag) \
>> > + {
>> \
>> > + uint8_t parm[2];
>> \
>> > + parm[0] = sa * (io_flag ^ 0x1) + da * (io_flag ^ 0x0);
>> \
>> > + parm[1] = sa * (io_flag ^ 0x0) + da * (io_flag ^ 0x1);
>> \
>> > + return combine_ ##cd## joint_ ##io## _part (parm[0], parm[1]);
>> \
>> > + }
>> > +/* Sets the macro for the array of function pointers, storing the
>> correct handler at the function entrance */
>> > +#define DEF_COMB_FUNC_ARR(cd,SUFFIX,suffix)
>> \
>> > + COMBINE_JOINT_FUNC_##SUFFIX combine_ ##cd## joint_ ##suffix[4] ={
>> \
>> > + combine_joint_zero_ ##suffix,
>> \
>> > + combine_ ##cd## joint_out_part_ ##suffix,
>> \
>> > + combine_ ##cd## joint_in_part_ ##suffix,
>> \
>> > + combine_joint_mask_ ##suffix
>> \
>> > + };
>> > +
>> > +typedef uint8_t (*COMBINE_JOINT_FUNC_U)(uint8_t a, uint8_t b, uint8_t
>> io_flag);
>> > +
>> > +DEF_FUNC_ZERO_MASK(uint8_t,zero,u, 0x0)
>> > +DEF_FUNC_ZERO_MASK(uint8_t,mask,u, ~0x0)
>> > +
>> > +DEF_FUNC_COMBINE_JOINT_U(dis, in);
>> > +DEF_FUNC_COMBINE_JOINT_U(dis, out);
>> > +DEF_COMB_FUNC_ARR(dis,U,u)
>> > +
>> > +DEF_FUNC_COMBINE_JOINT_U(con, in);
>> > +DEF_FUNC_COMBINE_JOINT_U(con, out);
>> > +DEF_COMB_FUNC_ARR(con, U, u)
>> > +/* Set an underlying function,'conjoint' and 'disjoint' related
>> functions can be called. */
>> > +static void
>> > +mmx_combine_joint_general_u (uint32_t * dest,
>> > + const uint32_t *src,
>> > + const uint32_t *mask,
>> > + int width,
>> > + uint8_t comb,
>> > + COMBINE_JOINT_FUNC_U *cjf)
>> > +{
>> > + COMBINE_JOINT_FUNC_U combine_joint_u[2];
>> > + combine_joint_u[0] = cjf[comb & COMBINE_A]; /* in_part */
>> > + combine_joint_u[1] = cjf[(comb & COMBINE_B)>>2]; /* out_par */
>> > +
>> > + uint32_t *end = dest + width;
>> > + while (dest < end)
>> > + {
>> > + __m64 s64 = combine (src, mask);
>> > + __m64 d64,sa64,da64;
>> > + uint8_t sa, da;
>> > + uint32_t tmp;
>> > + uint64_t Fa, Fb;
>> > +
>> > + /* Because these function contain division instructions,
>> > + * multimedia instruction are not used to optimize them.
>> > + */
>> > + store8888(&tmp, s64);
>> > + sa = tmp >> A_SHIFT;
>> > + da = *dest >> A_SHIFT;
>> > +
>> > + Fa = combine_joint_u[0](sa, da, 0);
>> > + Fb = combine_joint_u[1](sa, da, 1);
>> > +
>> > + d64 = load8888(dest);
>> > + sa64 = expand_alpha_rev (*(__m64*)&Fa);
>> > + da64 = expand_alpha_rev (*(__m64*)&Fb);
>> > +
>> > + d64 = pix_add_mul (s64, sa64, d64, da64);
>> > +
>> > + store8888 (dest, d64);
>> > +
>> > + ++dest;
>> > + ++src;
>> > + if (mask)
>> > + ++mask;
>> > + }
>> > +}
>> > +
>> > +
>> > +static void
>> > +mmx_combine_disjoint_general_u (uint32_t * dest,
>> > + const uint32_t *src,
>> > + const uint32_t *mask,
>> > + int width,
>> > + uint8_t comb)
>> > +{
>> > + mmx_combine_joint_general_u (dest, src, mask, width, comb,
>> combine_disjoint_u);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_disjoint_in_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_disjoint_general_u (dest, src, mask, width,
>> COMBINE_A_IN);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_disjoint_in_reverse_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_disjoint_general_u (dest, src, mask, width,
>> COMBINE_B_IN);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_disjoint_out_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_disjoint_general_u (dest, src, mask, width,
>> COMBINE_A_OUT);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_disjoint_out_reverse_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_disjoint_general_u (dest, src, mask, width,
>> COMBINE_B_OUT);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_disjoint_atop_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_disjoint_general_u (dest, src, mask, width,
>> COMBINE_A_ATOP);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_disjoint_atop_reverse_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_disjoint_general_u (dest, src, mask, width,
>> COMBINE_B_ATOP);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_disjoint_xor_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_disjoint_general_u (dest, src, mask, width,
>> COMBINE_XOR);
>> > +}
>> > +
>> > +/* Conjoint */
>> > +static void
>> > +mmx_combine_conjoint_general_u(uint32_t * dest,
>> > + const uint32_t *src,
>> > + const uint32_t *mask,
>> > + int width,
>> > + uint8_t comb)
>> > +{
>> > + mmx_combine_joint_general_u (dest, src, mask, width, comb,
>> combine_conjoint_u);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_conjoint_over_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_conjoint_general_u (dest, src, mask, width,
>> COMBINE_A_OVER);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_conjoint_over_reverse_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_conjoint_general_u (dest, src, mask, width,
>> COMBINE_B_OVER);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_conjoint_in_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_conjoint_general_u (dest, src, mask, width,
>> COMBINE_A_IN);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_conjoint_in_reverse_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_conjoint_general_u (dest, src, mask, width,
>> COMBINE_B_IN);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_conjoint_out_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_conjoint_general_u (dest, src, mask, width,
>> COMBINE_A_OUT);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_conjoint_out_reverse_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_conjoint_general_u (dest, src, mask, width,
>> COMBINE_B_OUT);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_conjoint_atop_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_conjoint_general_u (dest, src, mask, width,
>> COMBINE_A_ATOP);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_conjoint_atop_reverse_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_conjoint_general_u (dest, src, mask, width,
>> COMBINE_B_ATOP);
>> > +}
>> > +
>> > +static void
>> > +mmx_combine_conjoint_xor_u (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + mmx_combine_conjoint_general_u (dest, src, mask, width,
>> COMBINE_XOR);
>> > +}
>> >
>> > +/* Component alpha combiners */
>> > static void
>> > mmx_combine_src_ca (pixman_implementation_t *imp,
>> > pixman_op_t op,
>> > @@ -1089,6 +1503,410 @@ mmx_combine_src_ca (pixman_implementation_t
>> *imp,
>> > }
>> >
>> > static void
>> > +mmx_combine_saturate_ca (pixman_implementation_t *imp,
>> > + pixman_op_t op,
>> > + uint32_t * dest,
>> > + const uint32_t * src,
>> > + const uint32_t * mask,
>> > + int width)
>> > +{
>> > + uint32_t *end = dest + width;
>> > + while (dest < end)
>> > + {
>> > + uint16_t sa, sr, sg, sb;
>> > + uint32_t sa32, m32;
>> > + __m64 m64, s64, d64, sa64, da64, cmpf, res;
>> > +
>> > + mmx_combine_mask_ca (src, mask, &s64, &m64);
>> > +
>> > + d64 = load8888 (dest);
>> > + da64 = expand_alpha (negate(d64));
>> > + cmpf = _mm_cmpgt_pi16 (m64, da64);
>> > + if (cmpf)
>> > + {
>> > + store8888 (&m32, m64);
>> > + sa = (m32 >> (A_SHIFT));
>> > + sr = (m32 >> (R_SHIFT)) & MASK;
>> > + sg = (m32 >> (G_SHIFT)) & MASK;
>> > + sb = m32 & MASK;
>> > + sa32 = (~(*dest) >> A_SHIFT) & MASK;
>> > +
>> > + sa = (sa) ? sa : 0x1;
>> > + sr = (sr) ? sr : 0x1;
>> > + sg = (sg) ? sg : 0x1;
>> > + sb = (sb) ? sb : 0x1;
>> > +
>> > + sa32 = ((sa32 << G_SHIFT) / sb & MASK) |
>> > + ((((sa32 << G_SHIFT) / sg) & MASK) << G_SHIFT) |
>> > + ((((sa32 << G_SHIFT) / sr) & MASK) << R_SHIFT) |
>> > + ((((sa32 << G_SHIFT) / sa) & MASK) << A_SHIFT);
>> > + sa64 = load8888 (&sa32);
>> > + da64 = MC (4x00ff);
>> > + res = pix_multiply (s64, sa64);
>> > + s64 = _mm_or_si64 (_mm_and_si64 (res, cmpf),
>> _mm_and_si64 (s64, negate (cmpf)));
>> > + res = pix_multiply (d64, da64);
>> > + d64 = _mm_or_si64 (_mm_and_si64 (res, cmpf),
>> _mm_and_si64 (d64, negate (cmpf)));
>> > + }
>> > + res = _mm_adds_pu8 (s64, d64);
>> > + store8888 (dest, res);
>> > +
>> > + ++dest;
>> > + ++src;
>> > + if (mask)
>> > + ++mask;
>> > + }
>> > +}
>> > +
>> > +#define DEF_FUNC_COMBINE_JOINT_CA(cd, io)
>> \
>> > + static uint32_t inline combine_ ##cd## joint_ ##io##
>> _part_ca(uint32_t sa, uint32_t da, uint32_t io_flag) \
>> > + {
>> \
>> > + uint8_t da8 = da >> A_SHIFT;
>> \
>> > + uint32_t m, n, o, p, res;
>> \
>> > + uint8_t i, parm[2][4], shift=0;
>> \
>> > + for (i=0; i<4; i++)
>> \
>> > + {
>> \
>> > + parm[0][i] = (uint8_t)(sa>>shift) * (io_flag ^ 0x1) +
>> da8 * (io_flag ^ 0x0); \
>> > + parm[1][i] = (uint8_t)(sa>>shift) * (io_flag ^ 0x0) +
>> da8 * (io_flag ^ 0x1); \
>> > + shift += G_SHIFT;
>> \
>> > + }
>> \
>> > + m = (uint32_t)combine_ ##cd## joint_ ##io## _part (parm[0][0],
>> parm[1][0]); \
>> > + n = (uint32_t)combine_ ##cd## joint_ ##io## _part (parm[0][1],
>> parm[1][1]) << G_SHIFT; \
>> > + o = (uint32_t)combine_ ##cd## joint_ ##io## _part (parm[0][2],
>> parm[1][2]) << R_SHIFT; \
>> > + p = (uint32_t)combine_ ##cd## joint_ ##io## _part (parm[0][3],
>> parm[1][3]) << A_SHIFT; \
>> > + res = m | n | o | p;
>> \
>> > + return res;
>> \
>> > + }
>> > +
>> > +typedef uint32_t (*COMBINE_JOINT_FUNC_CA)(uint32_t sa, uint32_t da,
>> uint32_t io_flag);
>> > +
>> > +DEF_FUNC_ZERO_MASK(uint32_t, zero, ca, 0x0)
>> > +DEF_FUNC_ZERO_MASK(uint32_t, mask, ca, ~0x0)
>> > +
>> > +DEF_FUNC_COMBINE_JOINT_CA(dis, in);
>> > +DEF_FUNC_COMBINE_JOINT_CA(dis, out);
>> > +DEF_COMB_FUNC_ARR(dis, CA, ca)
>> > +
>> > +DEF_FUNC_COMBINE_JOINT_CA(con, in);
>> > +DEF_FUNC_COMBINE_JOINT_CA(con, out);
>> > +DEF_COMB_FUNC_ARR(con, CA, ca)
>> > +
>> > +static void
>> > +mmx_combine_joint_general_ca (uint32_t * dest,
>> > + const uint32_t *src,
>> > + const uint32_t *mask,
>> > + int width,
>> > + uint8_t comb,
>> > + COMBINE_JOINT_FUNC_CA *cjf)
>> > +{
>> > + COMBINE_JOINT_FUNC_CA combine_joint_ca[2];
>> > + combine_joint_ca[0] = cjf[comb & COMBINE_A];
>> > + combine_joint_ca[1] = cjf[(comb & COMBINE_B)>>2];
>> > +
>> > + uint32_t *end = dest + width;
>> > + while (dest < end)
>> > + {
>> > + __m64 m64, s64, sa64, da64, d64;
>> > +
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/pixman/attachments/20180929/a82d9688/attachment-0001.html>
More information about the Pixman
mailing list