[Pixman] [PATCH 00/32] (Mostly) ARMv6 speedups

Sun Oct 5 12:03:42 PDT 2014

Hi,

Comments on the patch series below. In the future, please consider
sending smaller patch sets so that I can review them in smaller chunks
of time. Those are easier to find than multiple consecutive hours ...

Thanks,
Søren

==================================================================
0001:	Typo in preload				Looks good
==================================================================

Looks good

==================================================================
0002:	lowlevel parse				Needs work 
0003:	affine benchmarker			Needs work 
==================================================================

Both of these could benefit from taking the support for
format/operator parsing and printing in check-format.c and moving it
to utils.[ch], perhaps extended to deal with strings like 'none' etc.

Also some coding style issues. (Use a typedef for structs, don't
define a struct on a single line, braces around statements that span
multiple lines -- see CODING_STYLE for more information.

==================================================================
0004:	Temporary buffers			Needs work 
==================================================================

The idea behind aligning the final buffer is that it allows fetchers
to assume that they can do a full 16-byte SIMD write to the buffer. We
may possibly not do that, though.

The change to use 15 instead of 32 is alright I guess. Is this
something that actually shows up in practice?

==================================================================
0005:	Early detection of 1x1			Needs work
==================================================================

I don't think this works as written. What if someone writes a
non-opaque pixel to the 1x1 image after it has been used once? Then
the flags would be wrong.

We could potentially do the check inside pixman_image_composite32(),
though you would have to also check at least for ~FAST_PATH_ACCESSORS.

If you want to go through with this patch, I think we need a test that
reads and writes opaque and non-opaque pixels to 1x1 pixel images, and
verification that the results don't change before/after the patch.

==================================================================
0006:	fast: over_n_8888  			Looks good
==================================================================

Looks good to me

==================================================================
0007:	armv6: over_n_8888			Not reviewed
==================================================================

Not reviewed

==================================================================
0008:	fast: over_n_0565			Looks good
==================================================================

Looks good to me

==================================================================
0009:	armv6: over_n_0565			Not reviewed
==================================================================

Not reviewed

==================================================================
0010:	fast: in_8888_8 - sRGB			Needs work 
0011:	armv6: in_8888_8 - sRGB?		Needs work 
==================================================================

These can presumably also be used for the a8r8g8b8_sRGB format since
all we care about is the alpha channel.

Also, is this actually showing up in the wild? Which applications?

==================================================================
0012:	armv6: over_8888_8_0565			Not reviewed
0013:	armv6: over_8888_n_0565			Not reviewed
0014:	in_n_8888				Not reviewed
0015:	Improved over_8888_8888			Not reviewed
0016:	single scanline				Not reviewed
0017:	arm: Move Bind_COMBINE			Not reviewed
0018:	armv6: ADD combiner			Not reviewed
0019:	armv6: OVER_REVERSE combiner		Not reviewed
0020:	armv6: IN, IN_REVERSE, ...		Not reviewed
0021:	armv6: OVER combiner   			Not reviewed
0022:	armv6: SRC combiner    			Not reviewed
==================================================================

These are all armv6 specific, and I did not review them beyond
checking that there's nothing obviously insane in them.

==================================================================
0023:	fetchers/writeback			Needs work 
==================================================================

Don't you need "FAST_PATH_STANDARD_DST_FLAGS" in the ITER_DEST case?
Also, since you have a writeback function for 565, you might want to
copy the fast_dest_fetch_noop setup from pixman-fast-path.c

==================================================================
0024:	fetcher for a1r5g5b5			Not reviewed
==================================================================

Not reviewed

==================================================================
0025:	src_1555_8888				Needs work 
==================================================================

Can these be used for x8r8g8b8/x8b8g8r8 targets as well? I did not
look at the assembly.

==================================================================
0026:	nearest_scaled_cover_a8r8g8b8		Not reviewed
0027:	nearest_scaled_cover_r5g6b5
0028:	nearest_scaled_cover_x8r8g8b8
0029:	nearest_scaled_cover_a8
==================================================================

I only looked at the table initializations, which look good to me.

==================================================================
0030:	nearest_scaled_cover src_8888_8888
0031:	four more
0032:	nearest_scaled_cover
==================================================================

I can't say I'm a big fan of these patches. They are essentially
workarounds for the longstanding problem that pixman sometimes won't
pick the fastest code for some operations. In this case the general
iter based code will be faster then the C code in pixman-fast-path.c
because the iter code will use assembly fetchers.

As a result you end up with a bunch of partial reimplementations of
general_composite_rect() inside pixman-arm-simd.c.

Maybe we need to admit failure and make general_composite_rect()
available for other implementations to use. Essentially we would
officially provide a way for implementations to say: My iterators are
faster than pixman-fast-path.c for these specific operations, so just
go directly to general_composite_rect().

It's definitely not a pleasant thing to do, but given that nobody is
likely to have the time/skill combination required to do the necessary
redesign of pixman's fast path system, it might still be preferable to
to do this instead of duplicating the code like these patches do.

With a setup like that, we could also fix the same issue for the
bilinear SSSE3 fetchers and possibly other cases.

Regarding 0031: If you had the noop setup mentioned in 0023, this
looks like it could also just be a fallback to general_composite_rect.

==================================================================
0033 fetcher-for-a8r8g8b8-bilinear-interpolatio.patch
0034 fetcher-for-x8r8g8b8-bilinear-interpolatio.patch
0035 fetcher-for-r5g6b5-bilinear-interpolation-.patch
0036 fetcher-for-a8-bilinear-interpolation-scal.patch
==================================================================

I think this one might benefit from having the infrastructure added
first and some more comments on how the code works.

And why does it have to be so much more complicated than the SSSE3
code?

Coding style: Use /* */ instead of //
Coding style: braces on separate lines
Coding style: Comments look like this:

       /* blah blah
        * blah blah
	*/

not like this:

       /* blah blah
        * blah blah */

Also, has this code been tested with BILINEAR_INTERPOLATION_BITS set
to anything other than 7?

==================================================================
0037: More accurate COVER_CLIP_NEAREST|BILINEAR		Needs work
==================================================================

A concern I have here is that code might access pixels outside the
image that have weight 0. Ie., in the bilinear case, some code might
attempt to read the pixel at bits[-1] and then multiply it with 0. But
we can't assume that bits[-1] is readable.

If I remember correctly, the +pixman_fixed_e * 8 stuff was intended to
handle this case.

I think it would be worthwhile to have a test that uses fence_malloc
for the source buffer and the matrix mentioned in the commit. In fact,
the fence_malloc() testing could benefit from being extended in
various ways:

      - having fence pages both before and after the image
      - having fence pages in the 'stride' part of the image