[Pixman] Prototype JIT compiler

Fri Jan 17 15:15:24 PST 2014

Hello, 

Over the Christmas holidays, I spent some time writing a prototype JIT
compiler for pixman. Since I may not be able to spend much time working
on it in the near future, I thought I'd write up a bit of information
about it, in case someone else wants to play around with it.

The code is available in this branch:

    http://cgit.freedesktop.org/~sandmann/pixman/commit/?h=jit

Some things about it work quite well I think:

- pixman-jit-x86-asm.[ch]:

These files are a runtime assembler for x86 (both 64 and 32 bit). It
supports most things that you expect from an assembler, such as labels
and code alignment. It also correctly selects the best encoding whenever
there is a choice (eg., for "add rax, $17" it will pick the short
encoding available for the rax register). It also uses short jumps
whenever possible.

At the same time, the code is compact and fast. All the instructions are
described in one big ~40Kb table, and the rest of the code (apart from
the table) is just 1600 lines. It doesn't support every x86 instruction,
but it's easy to add anything that is missing (although AVX-512 with its
optional arguments may require a bit of work to support).

If/when support for other architectures is added, this file would likely
have to be split up in order to share the code for "bookkeeping" (labels
etc.), while allowing different instruction sets.

A missing feature is the ability to free the generated code, and there
are almost certainly bugs related to out-of-memory conditions.

The way to use it is like this:

    fragment1 = assembler_create_fragment (asm);

    BEGIN_ASM (fragment1)
          DEFINE_LABEL ("begin"),
          I_mov,         rax,       IMM (17),
          I_add,         rbx,       rax,
          I_jne,         LABEL ("done"),
          I_jmp,         LABEL ("begin"),
          DEFINE_LABEL ("done"),
          I_ret,
    END_ASM ();

    code = assembler_link (asm, fragment1, fragment2, NULL);

where assembler_link will concatenate the code described in fragment1
and fragment2, resolve labels and jumps, then return a pointer to
executable code.

- pixman-jit-code-manager.[ch]:

These files handle memory management for executable memory, ie., mapping
files and marking them writable and executable as needed. When SELinux
and other security features are involved you can't just malloc() some
memory and execute it.

An interesting potential feature would be to make the allocated files
ELF files so that they would show up in profilers. The main missing
piece here is the ability to free allocated memory.

(And of course there is no support for Windows or anything else
non-Linux).

- New pixman infrastructure to deal with jitted compositing functions

This turned out to be a pretty simple extension of the 'fast path'
mechanism already in place. Instead of implementations exposing a table
of fast paths, they now expose a pair of functions 'create_composite'
and 'destroy_composite' that are passed the flags supported by the
images in questions, where by default create_composite() simply scans
the fast path table looking for a match, and destroy_composite() is a
noop.

When a fast path is evicted from the cache, destroy_composite() is
called. A jitting implementation can then simply have its
create_composite() jit a compositing function and destroy it in
destroy_composite(). Caching is then handled by the existing
mechanism. (It may be useful to expand the fast path cache if jitting
turns out to be very expensive).

The less convincing part is pixman-x86-jit.[ch], which is a jitting
implementation for x86-64 that can generate various blitter-type
compositing functions. This means it isn't all that useful right now
because pixman-sse2.c has those covered pretty well.

Also, while the code generated isn't totally awful, it isn't really
great either. In particular, it is built on the assumption that
source/mask/destination will be converted to a8r8g8b8/a8/a8r8g8b8 and
then combined. However, in many cases, this isn't optimal:

- For a8b8g8r8 OVER a8b8g8r8 it is clearly counterproductive to convert
  both source and destination to a8r8g8b8, and then convert back to
  a8b8g8r8.

- For solid sources we want to do all the source swizzing outside of the
  main loop, and also extract the alpha channel into its own
  register. There is no support for this currently.

Another less-convincing bit is the register allocator, which is as dumb
as possible. All it does is keeping track of available registers and
hand them out as needed. There is no spilling, and if it runs out of
registers, it will simply abort().

This is sufficient for generating blitters on x86-64 (provided the
aborting is turned into giving up), where we have 14 general purpose
registers, but won't be good enough for x86-32, nor will it be good
enough if more complicated sources than regular untransformed images are
added.

There is also no support for dealing with constants. When there is
enough registers, we want to allocate constants in registers; when there
isn't we want to put them ideally in one shared constant pool, but
otherwise, on the stack, or directly embedded in the code (in the case
of x86-64, where we have RIP-relative addressing).

Right now, certain specific constants are just permanently allocated in
registers, but this is not ideal.

Finally, there is a large number of boring, but straightforward, tasks,
such as dealing with OOM and fixing memory leaks. Also the patch series
of course would need to be cleaned up.

Anyway, if anyone is interested in this, I'll be happy to answer
questions either here on the mailing list or on IRC, but as mentioned I
may not be able to do a lot of work on it for a while.

Søren