[Pixman] [PATCH 2/2] ARM: Add 'neon_composite_over_n_8888_0565' fast path

Mon Jun 6 15:36:08 PDT 2011

Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:

> I'm not sure if there is a real need for luajit or some other existing
> JIT engine. The conversion of existing ARM NEON macro assembly to
> runtime code generation should be actually pretty straightforward and
> not so challenging (but still pretty time consuming). And it involves
> a few steps:
>
> 1. The use of NEON instructions indeed can be translated into
> 'emit_some_instruction()' calls. The layers of nested macros can also
> easily handled via higher level functions, so that
>     .macro blah
>         /* do something */
>     .endm
> can be directly translated into 'emit_blah()' function. The
> correctness of runtime code generation can be verified by comparing
> its results with the output of gnu assembler (they should be
> identical). An immediate benefit here is that some parameters which
> are currently configured at compile time (RESPECT_STRICT_ALIGNMENT and
> PREFETCH_TYPE_DEFAULT), can be actually tuned at runtime in order to
> be better optimized for the detected CPU variant.
>
> 2. Modify 'emit_some_instruction()' functions in such a way that they
> do not immediately generate instructions, but also pass them through a
> basic instruction scheduler and automatic pipeliner which knows about
> NEON instruction latencies. Make it also provide some debugging output
> such as the estimated number of cycles the code takes per pixel and
> how many pipeline stalls it still contains (so that the worst cases
> could be analyzed and improved).
>
> 3. Automatic code generation for a wide range of fast paths by just
> chaining FETCH -> COMBINE -> STORE blocks. This also needs at least a
> basic register allocator in order to smoothly glue the pieces
> together.

FWIW, a simple no-IR JIT compiler is what I wanted to do a while back
for x86, but stopped working on it when I realized that the register
allocator would either become very complex or very poor.

> And I specifically don't want any intermediate machine independent
> code here. Just because for example ARM NEON and x86 SSE2 are very
> different, and I expect that targeting them both via some portable
> pseudocode is going to become either inefficient or way too complex.
> So I think that if introducing x86 JIT backend would be also desired,
> it is better to be implemented independently (maybe just sharing some
> really high level parts), and probably primarily targeting Intel Atom
> as it has a very simple and predictable pipeline (should be easy to
> describe for the scheduler), and also it is probably the most
> performance limited relevant x86 processor.

There are some longer-term reasons that intermediate machine independent
code could be useful:

- supporting more complex operations, like CoreImage does
- adding a shader language to pixman and cairo
- targeting GPUs

However, those things don't mean an ARM specific JIT compiler that was
more or less running the GNU as code at runtime wouldn't be useful.

>> I think one issue that prevented that from going into pixman proper was
>> that there was no good way to get the computed flags down to the general
>> code path.
>
> What prevented it from going to pixman git master was that the code is
> quite hackish and not clean. And there is little motivation to clean
> it because there hardly going to be many users for it (only those who
> are familiar with pixman code already). Other than that, it works
> mostly fine and just adds a bit of runtime overhead.

If it was turned on in development releases, then it would become
possible to ask people for this information if they complain about
performance. Also, if pixman could report it even when a fast path is
taken, it could be useful to track software fallbacks from hardware
accelerated drivers.

>> If so, it might be interesting to combine it with this
>> branch:
>>
>>        http://cgit.freedesktop.org/~sandmann/pixman/commit/?h=composite-args
>>
>> in which the composite arguments are passed in a stack allocated struct
>> instead of as function arguments. The computed flags could then be
>> stored in that struct too with only minimal overhead.
>
> Is it an attempt to bring the old FbComposeData struct back to life,
> now rebranded as pixman_composite_info_t?

Yes, more or less, except that this time it would be used in all the
composite routines, not just FbCompositeRect().

> I'm actually all for this change if it gets confirmed to work a bit
> better and faster (and I expect that it should, considering that all
> this data can be passed through some nested calls multiple
> times). Hopefully we are not running in circles.

The FbComposeData struct was used in precisely one place to pass
information between two functions. It could not possibly have provided
any benefit as it existed in X server 1.3, which is the version that
eventually became pixman.

I think what might have happened is that full support for FbComposeData
was introduced in one of the several forks that existed around 2005 and
never merged into Xorg, except for that one place. 

So I'm not too worried about running in circles, but I don't know if it
is actually a performance advantage. I tend to think that if such
microoptimizations really result in measurable performance advantage,
then then the problem should probably be fixed in some other way.

Soren