[Pixman] [PATCH 06/12] vmx: implement fast path vmx_composite_over_n_8888_8888_ca

Thu Jul 16 02:39:15 PDT 2015

On Wed, 15 Jul 2015 16:06:14 +0300
Pekka Paalanen <ppaalanen at gmail.com> wrote:

> On Wed, 15 Jul 2015 15:41:19 +0300
> Oded Gabbay <oded.gabbay at gmail.com> wrote:
> 
> > >> > Or if we don't care about that, why?
> > >> I think that the speedups in this specific patch are more substantial
> > >> than the slowdowns. If it was the other way around, than I would have
> > >> removed this patch, like I did with another patch, which Siarhei
> > >> rejected because of it.
> > >
> > > But in theory, you should not get any slowdowns, right? Or did you
> > > actually expect that some things will slow down?
> > I'm not so sure I won't get any slowdowns. I guess it depends on the
> > size of the image and the amount of alignment that needs to be done
> > for that image
> > 
> > e.g. if we have many small images, and for each image we need to do
> > unaligned operations first to make sure we are 16-bytes aligned, then
> > the unaligned operations may take more cycles then the cycles that are
> > saved from doing the vmx operations. There could be extreme cases,
> > where there is one vmx operation on aligned data and all the rest of
> > the operations are unaligned. Now, in the C implementation, you don't
> > care about unalignment, so you always work in 4 byte quanitites.
> > 
> > Does that make sense ?
> 
> Yeah, this kind of speculation is what I'm after. A plausible and
> acceptable reason why some things may slow down.
> 
> I suppose I'm too new here to understand that without pointing it out.
> 
> However, I think lowlevel-blt-bench's tests should be hitting some of
> those ugly tiny image cases, yet even the worst case there has +22%
> performance. Of course, it's possible that Cairo benchmarks happen to
> hit the ugly tiny images consistently badly, while llbb aims to cover
> all kinds of alignment equally.
> 
> It's all up to judgement, but I think one does need to at least ask the
> question "This is slightly odd... is something actually wrong?"
> 
> I and Ben definitely did find something very strange about the
> Raspberry Pi 1 CPU.

The explanation is actually rather simple. The functions
composite_over_n_8888_8888_ca and composite_over_n_8888_8888
are working with text glyphs and text glyphs typically contain a lot
of fully transparent pixels. The C fast path code typically contains
branches, which skip over such pixels. For example, have a look
at (not exactly this particular functions, but explains the idea):
    http://cgit.freedesktop.org/pixman/commit/?id=2bc59006d7fe91abf68a2061ad86c06e1b2964ab

The lowlevel-blt-bench does not simulate a realistic distribution
of opaque/transparent/translucent pixels and their relative placement.
And in fact, this distribution may be different for different use cases
and cairo traces.

The costs/benefits of branches skipping over transparent pixels can be
different on different platforms and microarchitectures. There are no
universal rules.

The loop setup overhead also plays its role when the images are small.

-- 
Best regards,
Siarhei Siamashka