[Pixman] [PATCH 2/2] sse2, mmx: Remove initial unaligned loops in fetchers
Bill Spitzak
spitzak at gmail.com
Wed Sep 4 12:36:11 PDT 2013
Søren Sandmann wrote:
> Here is another proposal, but I'm not sure it's really better:
>
> - The combiners are made to return a buffer. The returned buffer is
> expected to contain the combined result and may be any of the passed
> src/mask/dest buffers. Almost all combiners will continue to combine
> into the dest buffer, which they will also return. But the SRC
> combiner will simply return the source buffer without any memcpy()ing.
This sounds exactly like how Nuke (compositing software I wrote that is
used in special effects) works, which is doing compositing of often
hundreds of steps on the CPU quite fast.
Work is done per scanline. The final destination first allocates or
locates a buffer to write the scanline to. It then calls the last step
of the compositing, passing it the buffer. This last step then returns a
pointer to the resulting scanline. This may be in the passed buffer, or
other memory (typically a pointer to a source buffer). As you point out,
if the final step wants to put the result in the buffer it must do a
memcpy if the returned pointer is not to the buffer:
float* buffer = framebuf + y*w;
float* result = final_step(buffer, w);
if (result != buffer) memcpy(buffer, result, w * sizeof(*buffer));
The compositing step recursively calls input steps. In most cases the
output buffer it got is passed to the input, so the final step does not
need to allocate any temporary buffer. The input may write it's result
to the output buffer, and then the final step replaces it with it's
calculation. In most cases this requires no actual thought and works
perfectly. Here is a pseudo code version of a step that adds 1 to every
pixel:
float* add1(float* buffer, int w) {
float* source = input_step(buffer, w);
for (int i = 0; i < w; i++)
buffer[i] = source[i] + 1;
return buffer;
}
Note that a no-op in the middle does not need to do a memcpy. Instead it
just returns the buffer returned from the input:
float* noop(float* buffer, int w) {
return input_step(buffer, w);
}
Sometimes a step needs to use a second buffer. These are allocated on
the stack using alloca (except on Windows, sigh...). The most common
reason is that a step merges the output of two or more input steps:
float* add_two_inputs(float* buffer, int w) {
float* a = input_a(buffer, w);
float temp[w];
float* b = input_b(temp, w);
for (int i = 0; i < w; i++)
buffer[i] = a[i] + b[i];
return buffer;
}
I think some things you may think need different buffers can reuse them:
int* convert_565_to_888(int* buffer, int w) {
short* a = input((short*)buffer, w);
for (int i = w; i--; ) // note it must go backwards!
buffer[i] = pixel_565_to_888(a[i]);
return buffer;
}
In any case this scheme greatly reduces the amount of copying, and
(probably more important) the amount of memory allocation/free being
done. I would recommend it for cairo.
More information about the Pixman
mailing list