[Pixman] [PATCH 2/3] mmx: fix unaligned accesses

Wed Aug 3 07:15:10 PDT 2011

On Mon, Aug 1, 2011 at 6:29 AM, Matt Turner <mattst88 at gmail.com> wrote:
> On Sat, Jul 23, 2011 at 10:28 PM, Siarhei Siamashka
> <siarhei.siamashka at gmail.com> wrote:
>> The 'test1' function does not look good because it uses ARM
>> instructions to read data one byte at a time and combine it. Function
>> 'test2' looks a bit better because it now uses WALIGNR, but this is
>> still not an optimal solution. Ideally, if we need to read N
>> contiguous unaligned 64-bit values, this requires (N + 1) loads via
>> WLDRD instructions and N fixups via WALIGNR, also shift argument for
>> WALIGNR has to be calculated only once.
>
> Do you think I should do separate while loops for unaligned and aligned?
>
> i.e., in pixman_blt_mmx,
>
> if ((unsigned long)s & 7) {
>    while (w >= 64) {
>        /* src is aligned. no walign fix-ups needed */
>    }
> } else {
>    while (w >= 64) {
>        /* src is unaligned. N+1 loads and N fix-ups needed */
>    }
> }

Yes, the implementations effectively need to be separate, just because
unaligned case needs to do N+1 loads and aligned case needs one load
less. Having wrong number of loads can cause bad memory accessed,
potentially leading to crashes. But in order to avoid extra code
duplications, using inline helper function and some macros can be
tried:

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

/* just some debugging output */

void
do_something(uint64_t v)
{
    printf("%016" PRIx64 " ", v);
}

/* WLDRD and WALIGNR instructions emulation */

uint64_t
wldrd (uint64_t *ptr)
{
    return *ptr;
}

uint64_t
walignr (uint64_t t1, uint64_t t2, int align)
{
    return (t2 << (64 - align * 8)) | (t1 >> (align * 8));
}

#define UNALIGNED64_FETCHER_INIT(t1, t2, ptr, is_aligned, align)       \
    /* if not aligned, then we need to load n+1 values (one extra) */  \
    if (!is_aligned)                                                   \
    {                                                                  \
        t1 = wldrd (ptr);                                              \
        ptr++;                                                         \
    }

#define UNALIGNED64_FETCH(v, t1, t2, ptr, is_aligned, align)           \
    t2 = wldrd(ptr);                                                   \
    ptr++;                                                             \
    if (is_aligned)                                                    \
        v = t2;                                                        \
    else                                                               \
        v = walignr (t1, t2, align);

void __attribute__((always_inline))
process_n_int64_helper (uint64_t *ptr, int n, int is_aligned, int align)
{
    uint64_t t1, t2, v;

    /* 'is_aligned' is a compile time constant and can be optimized */
    UNALIGNED64_FETCHER_INIT (t1, t2, ptr, is_aligned, align);

    while (n >= 2)
    {
        UNALIGNED64_FETCH (v, t1, t2, ptr, is_aligned, align);
        do_something (v);
        UNALIGNED64_FETCH (v, t2, t1, ptr, is_aligned, align);
        do_something (v);
        n -= 2;
    }
    if (n > 0)
    {
        UNALIGNED64_FETCH (v, t1, t2, ptr, is_aligned, align);
        do_something (v);
    }
    printf("\n");
}

void
process_n_int64 (void *ptr, int n)
{
    int align = (uintptr_t)ptr & 7;
    printf("align = %d\n", align);
    if (align == 0)
        process_n_int64_helper ((uint64_t *)ptr,
                                n, 1, 0);
    else
        process_n_int64_helper ((uint64_t *)((uint8_t *)ptr - align),
                                n, 0, align);
}

int
main (void)
{
    static char __attribute__((aligned(16))) buf[256];
    int i;
    for (i = 0; i < 256; i++)
        buf[i] = i;

    process_n_int64 (buf + 0, 3);
    process_n_int64 (buf + 1, 4);
    process_n_int64 (buf + 7, 3);
    process_n_int64 (buf + 8, 4);

    return 0;
}

A possible improvement over the pseudocode suggested by Soeren is that
'st = new_st' register move is not needed anymore, but this requires
loop unrolling. Another side effect of loop unrolling is that the
compiler will have a bit more freedom scheduling instructions.

> walign{i,r} have 1-cycle latency and throughput, and back-to-back
> w{ld,st}rd instructions seem to cause a stall. So it seems to me that
> I can put walign instructions between wldrd instructions and not lose
> any performance, even when the loads are aligned.

Do you have an optimization manual with the instruction cycle timings
for this particular Marvell processor from OLPC XO-1.75? Because these
cycle timings look very much exactly like the information for the old
Intel XScale processors and I guess something could have changed since
then. I could not find any precise details, but AnandTech says that
Armada 610 is "an in-order, dual issue core". An interesting question
is whether dual issue is supported for IWMMXT instructions there, or
is it single issue with the same cycle timings as XScale? Or maybe it
is different, but still single issue? I guess there could be many
possibilities :)

BTW, I have just written a blog post about experimenting with
instruction cycle timings for different processors (nothing new, but
just summarizes the information from some old discussions from irc and
the mailing lists):
    http://ssvb.github.com/2011/08/03/discovering-instructions-scheduling-secrets.html

-- 
Best regards,
Siarhei Siamashka