[Pixman] [PATCH 2/2] ARM: Add 'neon_composite_over_n_8888_0565' fast path

Mon Jun 6 06:06:56 PDT 2011

On Thu, Apr 7, 2011 at 7:15 AM, Soeren Sandmann <sandmann at cs.au.dk> wrote:
> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
>
>> Of course, people reading the source code need to know about this
>> "convention". And it has its own disadvantages too. If anyone can
>> propose something more maintainable and easier to read, I'm all ears.
>>
>> Maybe changing to the use of native codegenerator to compile fast path
>> code at runtime could make it easier. If we do a good job teaching it
>> to know instruction scheduling rules well enough.
>
> As a tangent, I took a look at LuaJIT's dynamic assembler:
>
>        http://luajit.org/dynasm.html
>
> It's MIT licensed and looks quite interesting. It has a clever idea,
> where it does almost all the assembling at compile time, so that there
> is no need to have a runtime assembler with "emit_movzx()" type
> functions.
>
> Here is how it works: A mix of C and assembly is preprocessed. The C
> code is emitted directly; the assembly is converted to machine code with
> dummy labels. Then, at runtime this bytecode is interpreted, which emits
> and links the machine code.
>
> The advantage of this scheme is that the runtime component doesn't need
> to know anything about instruction encodings or addressing modes, so it
> can be really tiny - a few kilo0bytes or so. It also means you can write
> real assembly instead of calling emit_*() functions.
>
> However, a downside is that it could be difficult to do good code
> scheduling since it seems it would work best if it can stitch together
> pre-written blocks of assembly, much like the code generator macros do
> for the NEON fast paths.
>
> Other potential issues is that Lua would become a build-time dependency
> for pixman since the preprocessor is written in Lua, and that it
> currently doesn't support NEON, though presumably he would take patches.
>
> Anyway, it seems to me to be worth taking a closer look at it to see if
> it could be suitable as the basis of a pixman JIT compiler.

I'm not sure if there is a real need for luajit or some other existing
JIT engine. The conversion of existing ARM NEON macro assembly to
runtime code generation should be actually pretty straightforward and
not so challenging (but still pretty time consuming). And it involves
a few steps:

1. The use of NEON instructions indeed can be translated into
'emit_some_instruction()' calls. The layers of nested macros can also
easily handled via higher level functions, so that
    .macro blah
        /* do something */
    .endm
can be directly translated into 'emit_blah()' function. The
correctness of runtime code generation can be verified by comparing
its results with the output of gnu assembler (they should be
identical). An immediate benefit here is that some parameters which
are currently configured at compile time (RESPECT_STRICT_ALIGNMENT and
PREFETCH_TYPE_DEFAULT), can be actually tuned at runtime in order to
be better optimized for the detected CPU variant.

2. Modify 'emit_some_instruction()' functions in such a way that they
do not immediately generate instructions, but also pass them through a
basic instruction scheduler and automatic pipeliner which knows about
NEON instruction latencies. Make it also provide some debugging output
such as the estimated number of cycles the code takes per pixel and
how many pipeline stalls it still contains (so that the worst cases
could be analyzed and improved).

3. Automatic code generation for a wide range of fast paths by just
chaining FETCH -> COMBINE -> STORE blocks. This also needs at least a
basic register allocator in order to smoothly glue the pieces
together.

And I specifically don't want any intermediate machine independent
code here. Just because for example ARM NEON and x86 SSE2 are very
different, and I expect that targeting them both via some portable
pseudocode is going to become either inefficient or way too complex.
So I think that if introducing x86 JIT backend would be also desired,
it is better to be implemented independently (maybe just sharing some
really high level parts), and probably primarily targeting Intel Atom
as it has a very simple and predictable pipeline (should be easy to
describe for the scheduler), and also it is probably the most
performance limited relevant x86 processor.

>> I think it might be interesting for you. I also have the following
>> experimental branch:
>> http://cgit.freedesktop.org/~siamashka/pixman/log/?h=playground/slow-path-reporter
>>
>> It collects statistics about what operations do not have optimized
>> fast paths, along with the number of uses of these operations, total
>> number of pixels processed, average number of pixels per operation and
>> average scanline length. The code is currently linux specific and
>> writes results to syslog. These results can be converted into a more
>> human readable form by a script. I'm using it quite successfully and
>> it revealed some of the missing optimizations which would be hard to
>> identify in some other way.
>
> I think one issue that prevented that from going into pixman proper was
> that there was no good way to get the computed flags down to the general
> code path.

What prevented it from going to pixman git master was that the code is
quite hackish and not clean. And there is little motivation to clean
it because there hardly going to be many users for it (only those who
are familiar with pixman code already). Other than that, it works
mostly fine and just adds a bit of runtime overhead.

> If so, it might be interesting to combine it with this
> branch:
>
>        http://cgit.freedesktop.org/~sandmann/pixman/commit/?h=composite-args
>
> in which the composite arguments are passed in a stack allocated struct
> instead of as function arguments. The computed flags could then be
> stored in that struct too with only minimal overhead.

Is it an attempt to bring the old FbComposeData struct back to life,
now rebranded as pixman_composite_info_t? I'm actually all for this
change if it gets confirmed to work a bit better and faster (and I
expect that it should, considering that all this data can be  passed
through some nested calls multiple times). Hopefully we are not
running in circles.

-- 
Best regards,
Siarhei Siamashka