[Pixman] [PATCH 0/5] ARMv6: New fast path implementations that utilise prefetch

Ben Avison bavison at riscosopen.org
Mon Jan 7 09:09:11 PST 2013


Hello again - I'll reply to Siarhei's various replies in one email, for
simplicity.

On Thu, 27 Dec 2012, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
> Can we have some more descriptive commit message for this and the
> other patches? Preferably benchmark results should be also here for
> the newly added or improved code.

You'll have to forgive my naïveté in the process of code distribution via
mailing lists - I've never done so before, and hadn't even heard of git
format-patch and send-email. I was merely attempting to copy earlier
examples. In my case, I intended my posting to form 5 parts of a single
patch, divided up by source file - I now see that's not what you were
expecting.

In light of the fact that I'm not the only person to have tripped up over
the format required in the last couple of weeks, I would suggest maybe some
guidelines about preferred submission formats, length/contents of commit
messages, pointers to step-by-step instructions and so on wouldn't go amiss
- perhaps at pixman.org?

I'm still a bit confused about how multi-part posts work - some but not all
seem to have a 0th part containing no actual diffs. Maybe I don't have to
worry about it because it sounds like the code I'm immediately concerned
with posting will just be one enormous email, but it still nags at me.

I may have to delay posting an updated patch a little while if it needs
updated or additional benchmarking results in the commit message. Rather
than posting a series of patches that differ mainly or wholly in the
contents of the commit message, I think it makes sense for me to clear up a
few points first, please see below.

> Are you sure you have really set this "disable_l2cache_writealloc=1"
> option? Because it was initially introduced with a spelling error:
>     https://github.com/raspberrypi/firmware/commit/43b688a2e0a3afd6
> and corrected a bit later:
>     https://github.com/raspberrypi/firmware/commit/be61355f9e7e28a5

On this point, I had noticed the spelling error - but I hadn't realised
that it had been corrected later. The figures I posted previously relate to
a Raspberry Pi in its default configuration; if you would prefer an
alternate set which examines the effect when the cache is configured the
other way, obviously I'll need to re-run all my benchmarks. Maybe it would
make sense to have all 4 sets of results (both old and new code, cache
configured both ways), although trying to represent the results succinctly
becomes even harder then.

Since my case goes to such lengths to ensure it uses 16-byte writes, I'd be
surprised if the setting has all that much effect on my timings (except the
tiny rectangle tests). Of course, by comparison, the "before" cases might
be expected to improve. However, from some simple simulations of inner
loops of blits and fills that I tried during development, the time penalty
of a cache miss on reading was such a dominant effect - and one that isn't
affected by changing the write-allocate setting at all - that I'm still
fairly confident that my routines should work out as a net benefit.

>> Some of the things I intend to tackle in the new year are:
>> * no thought has been put into 24bpp formats yet
>
> You can just provide optimized fetch/store iterators for it. I don't
> think anybody seriously uses this format.

I'm afraid I'm not sure what you mean by fetch/store iterators. Do you mean
routines to convert back and forth between r8g8b8 and x8r8g8b8 (with all
the actual compositing operations being performed on the x8r8g8b8 copy)?

>> * the number of fast path operations is small compared to other
>>    implementations; I'm targeting eventual parity with the number of
>>    operations in the NEON case
>
> I'm not sure if it's a really good idea. Nowadays pixman got better
> fetch/write back iterators, which avoid doing obviously redundant
> operations and can, for example, perform compositing directly in the
> destination buffer. They also do not try to fetch and convert data
> from the destination buffer to a8r8g8b8 format in the temporary buffer
> in the case if this data is not needed for the compositing operation.

This seems to suggest the opposite. Perhaps you are advocating inserting
inline assembler into the C code? I'm not sure that would work well because
of the extent to which my assembler optimisations rely on scheduling
preloads and unrolling loops to minimise the number of preload instructions
(because of the speed hit of preloading the same cache line twice).

> One problem with lowlevel-blt-bench is that it can't be used for
> benchmarking OVER compositing operation (because some shortcuts are
> possible, depending on the pixel data in the case if it is completely
> opaque or transparent).
[...]
> Yes, the current traces from http://cgit.freedesktop.org/cairo-traces
> can't be effectively used on really slow embedded systems such as ARM11
> and MIPS32. It could take a day just to run the benchmark. That's why I
> tried to take these traces and trim them to make the execution time
> reasonable on Raspberry Pi and on the other slow systems:
>     http://cgit.freedesktop.org/~siamashka/trimmed-cairo-traces/

I had come across the cairo cases and how slow they are! My over routines
do still have shortcuts, but I couldn't think of a practical way of
benchmarking them in a sensible amount of time - I'll give your trimmed
version a go, thanks.

>> +.macro over_8888_8888_1pixel src, dst, offset, next
>> +        /* src = destination component multiplier */
>> +        rsb     WK&src, WK&src, #255
>> +        /* Split even/odd bytes of dst into SCRATCH/dst */
>> +        uxtb16  SCRATCH, WK&dst
>> +        uxtb16  WK&dst, WK&dst, ror #8
>> +        /* Multiply through, adding 0.5 to the upper byte of result for rounding */
>> +        mla     SCRATCH, SCRATCH, WK&src, MASK
>> +        mla     WK&dst, WK&dst, WK&src, MASK
>> +        /* Where we would have had a stall between the result of the first MLA and the shifter input,
>> +         * reload the complete source pixel */
>> +        ldr     WK&src, [SRC, #offset]
>> +        /* Multiply by 257/256 to approximate 256/255 */
>> +        uxtab16 SCRATCH, SCRATCH, SCRATCH, ror #8
>> +        /* In this stall, start processing the next pixel */
>> + .if offset < -4
>> +        mov     WK&next, WK&next, lsr #24
>> + .endif
>> +        uxtab16 WK&dst, WK&dst, WK&dst, ror #8
>> +        /* Recombine even/odd bytes of multiplied destination */
>> +        mov     SCRATCH, SCRATCH, ror #8
>> +        sel     WK&dst, SCRATCH, WK&dst
>> +        /* Saturated add of source to multiplied destination */
>> +        uqadd8  WK&dst, WK&dst, WK&src
>> +.endm
>
> Looks like this over_8888_8888_1pixel macro uses one instruction more
> than the current code. Is it intended?

There's one more instruction in this part of the code, yes, but not at the
cost of speed, due to filling in stall cycles. On the other hand, because
the macro that calls this is actually processing more than pixel per loop,
the outer loads and stores (not shown in the part you clipped) are able to
be done using LDM/STM (twice as many registers per cycle). The cost of this
is that there are fewer registers available to either hold constants like
255 and 0xFF00FF00, or to have both the complete pixel contents and its
alpha component in separate registers. I've dealt with that by using more
instructions, but fitting them into the stall slots so that they don't come
at additional cost.

If you look closely, I've saved one instruction by replacing the AND and
final UXTAB16 with a SEL (which does save one cycle), but then gained two
by adding the LDR and MOV (and both of those fit into stalls, so it's a net
improvement of 1 cycle even before you consider the LDM/STM). It's also
worth noting that the LDR will always be a L1 cache hit, because we've only
just done LDM of the same location a short while earlier, in order to
perform the shortcut test. For bonus points, the late result from the
UQADD8 only affects the final pixel of the group, since for all other
pixels there are plenty of cycles before we come to store it.

Ben


More information about the Pixman mailing list