[Pixman] [PATCH 1/4] vmx: optimize scaled_nearest_scanline_vmx_8888_8888_OVER

Thu Sep 17 08:58:13 PDT 2015

On Sun, 6 Sep 2015 10:08:06 -0700
Matt Turner <mattst88 at gmail.com> wrote:

> On Sun, Sep 6, 2015 at 8:27 AM, Oded Gabbay <oded.gabbay at gmail.com> wrote:
> > reference memcpy speed = 24847.3MB/s (6211.8MP/s for 32bpp fills)
> >
> >                 Before          After           Change
> >               --------------------------------------------
> > L1              182.05          210.22         +15.47%
> > L2              180.6           208.92         +15.68%
> > M               180.52          208.22         +15.34%
> 
> There's no variation between L1, L2, and M -- as a follow on, it might
> be interesting to experiment with unrolling the loop a bit. Looked
> like the other patches in this series show the same behavior.

In fact it might be a defect in the lowlevel-blt-bench. The benchmark
picks arbitrary sizes of the working sets and only labels them as 'L1',
'L2' and 'M'. The 'M' benchmark works with two 1920x1080 32bpp buffers,
which means that only around ~16MB of memory is touched at most. Modern
processors already have the last level cache of a comparable size and
https://en.wikipedia.org/wiki/POWER8 mentions 16MB of L4 cache.

One the other hand, ~200 MPix/s is a relatively low bandwidth by the
high performance desktop/server standards and should not be a problem
for the automatic hardware prefetch.

-- 
Best regards,
Siarhei Siamashka