[Pixman] [PATCH 01/11] mmx: add store function and use it in add_8888_8888

Sat Apr 14 19:02:12 PDT 2012

Wow, it's taken me a while to get back to this.

On Fri, Mar 16, 2012 at 1:06 PM, Søren Sandmann <sandmann at cs.au.dk> wrote:
> Most of this series look good, but a few comments:
>
> - Considering the fact that all the intrinsics are reimplemented in
>  inline assembly, do you actually need -march=loongson?

Yes. gcc won't generate some of the instructions without it.

> - Should the header file loongson-mmintrin.h be added to
>  libpixman_loongson_mmi_la_SOURCES? Otherwise it might not show up in
>  tarballs.

Yes, added.

> - It would be useful to have benchmark results in the commit log to be
>  sure that the MMI support is actually an improvement. Especially
>  cairo-trace runs since they are more realistic workloads than
>  lowlevel-blt.
>
> - It may be possible to do this:
>
>        if (is_equal (_mm_and_si64 (vsrc, MC (full_alpha)), MC (full_alpha)))
>
>  from mmx_combine_over_u() a bit more efficiently with an is_opaque()
>  function that did something like
>
>        _mm_cmpeq_pi8 (tmp, tmp);       /* always 0xffffffff */
>        return _mm_movemask_pi8 (_mm_cmpeq_pi8 (pixel, tmp)) & 0x40
>
>  Ie., check that just the 7th byte is 0xff.
>
> Regarding the optimizations of in_8_8, in_n_8_8, and add_8_8,

Good idea. I've made that change for !USE_LOONGSON_MMI since MMI's
pmovmask instruction doesn't actually write its result into an integer
register... Another example of how bad the instruction set is.

> - in_8_8 seems to be mostly a loss on iwMMX and not much of an
>  improvement on Loongson. I might believe that a more realistic
>  benchmark such as one of the cairo traces would show more of an
>  improvement, though.

Without MMI, the L1/L2/M lowlevel-blt-bench numbers are almost double
what they are with MMI, so this function needs some work anyway.

On top of that, I tried loading *dst earlier before the conditional
(like I accidentally did in the functions below) and it still hurt
performance on ARM, although less than with this patch. So, I guess
I'll drop the in_8_8 patch until I can figure out what's going on.

> - in_n_8_8: I don't understand this one. You replace a multiplication
>  with a branch, but (a) you don't avoid any memory references, which is
>  usually the point of these check-for-zero optimizations, and (b) you
>  only do it in the edge case and not the middle of the scanlines?

Not skipping the memory reference was a mistake, but ...

> - add_8_8: Here too you also only do the optimization in the edge
>  cases.

I can't (at least with lowlevel-blt-bench) measure any performance
improvements by doing the optimization in the middle of scanlines. I
suppose it's because the optimization would only skip a block of 4
pixels when all are all zero. That might be an uncommon case in a
synthetic benchmark like lowlevel-blt-bench. Another thing is that
you've got to wait until the src is loaded before you can know whether
you need to load dst, which potentially doubles the amount of time
you're waiting for memory.

I think because of the extra memory latency it might not be worth
doing the optimization.

On the other hand, as you can see it does help when doing the
optimization on the edge cases.

I'll plan to try to find some cairo traces that exercise the in_*
functions. I doesn't look like they are very important, since we don't
have NEON fast paths for them! :)

With the in_8_8 patch dropped and the other comments addressed, good to commit?