[Pixman] [PATCH 2/2] sse2, mmx: Remove initial unaligned loops in fetchers

Wed Sep 4 12:36:11 PDT 2013

Søren Sandmann wrote:

> Here is another proposal, but I'm not sure it's really better:
> 
> - The combiners are made to return a buffer. The returned buffer is
>   expected to contain the combined result and may be any of the passed
>   src/mask/dest buffers. Almost all combiners will continue to combine
>   into the dest buffer, which they will also return. But the SRC
>   combiner will simply return the source buffer without any memcpy()ing.

This sounds exactly like how Nuke (compositing software I wrote that is 
used in special effects) works, which is doing compositing of often 
hundreds of steps on the CPU quite fast.

Work is done per scanline. The final destination first allocates or 
locates a buffer to write the scanline to. It then calls the last step 
of the compositing, passing it the buffer. This last step then returns a 
pointer to the resulting scanline. This may be in the passed buffer, or 
other memory (typically a pointer to a source buffer). As you point out, 
if the final step wants to put the result in the buffer it must do a 
memcpy if the returned pointer is not to the buffer:

   float* buffer = framebuf + y*w;
   float* result = final_step(buffer, w);
   if (result != buffer) memcpy(buffer, result, w * sizeof(*buffer));

The compositing step recursively calls input steps. In most cases the 
output buffer it got is passed to the input, so the final step does not 
need to allocate any temporary buffer. The input may write it's result 
to the output buffer, and then the final step replaces it with it's 
calculation. In most cases this requires no actual thought and works 
perfectly. Here is a pseudo code version of a step that adds 1 to every 
pixel:

     float* add1(float* buffer, int w) {
       float* source = input_step(buffer, w);
       for (int i = 0; i < w; i++)
         buffer[i] = source[i] + 1;
       return buffer;
     }

Note that a no-op in the middle does not need to do a memcpy. Instead it 
just returns the buffer returned from the input:

     float* noop(float* buffer, int w) {
       return input_step(buffer, w);
     }

Sometimes a step needs to use a second buffer. These are allocated on 
the stack using alloca (except on Windows, sigh...). The most common 
reason is that a step merges the output of two or more input steps:

     float* add_two_inputs(float* buffer, int w) {
       float* a = input_a(buffer, w);
       float temp[w];
       float* b = input_b(temp, w);
       for (int i = 0; i < w; i++)
         buffer[i] = a[i] + b[i];
       return buffer;
     }

I think some things you may think need different buffers can reuse them:

     int* convert_565_to_888(int* buffer, int w) {
       short* a = input((short*)buffer, w);
       for (int i = w; i--; ) // note it must go backwards!
         buffer[i] = pixel_565_to_888(a[i]);
       return buffer;
     }

In any case this scheme greatly reduces the amount of copying, and 
(probably more important) the amount of memory allocation/free being 
done. I would recommend it for cairo.