[Pixman] disable cache prefetch on ATOM can improve the gtkperf performance

Wed Jun 23 00:59:45 PDT 2010

On Wed, 9 Jun 2010, Soeren Sandmann wrote:

> Soeren Sandmann <sandmann at daimi.au.dk> writes:
> 
> > prefetching. On an AMD Phenom with 512 KB of L2 and 512 KB of L3,
> > disabling prefetch was a tiny but consistent slow-down.

I ran the traces on an Intel Atom N450 with no L3, 512 KB of L2 and a 
whopping 32 KB of L1 (a decent size for Intel compared to P4s) and got 
the following perf-diff:

old: atom-with-noprefetch
new: atom-with-prefetch
Slowdowns
=========
image-rgba          firefox-world-map-0    40705.97 (40781.80 0.11%) -> 42901.58 (42964.03 0.06%):  1.05x slowdown
image-rgba         swfdec-giant-steps-0    7008.48 (7041.00 0.31%) -> 7392.33 (7414.32 0.25%):  1.05x slowdown
image-rgba                    poppler-0    7514.99 (7585.73 0.42%) -> 7928.72 (8025.66 0.40%):  1.06x slowdown
image-rgba          xfce4-terminal-a1-0    9758.85 (9759.44 0.02%) -> 10312.88 (10320.35 0.03%):  1.06x slowdown
image-rgba             firefox-woodtv-0    4197.62 (4198.42 0.07%) -> 4490.93 (4497.23 0.08%):  1.07x slowdown
image-rgba         gnome-terminal-vim-0    13625.58 (13650.19 0.09%) -> 14588.68 (14600.53 0.04%):  1.07x slowdown
image-rgba                  ocitysmap-0    5137.56 (5146.48 0.29%) -> 5772.16 (5784.54 0.17%):  1.12x slowdown

Cooked and raw numbers are available here for the interested:

	http://people.freedesktop.org/~joonas/tmp/atom/

With a lower threshold for cairo-perf-diff you'll see that nearly 
across the board there's a small slowdown from using prefetching.  On 
the whole my experience with prefetching for graphics is that it's not 
been very useful to do manually mostly since the streaming access 
logic on most memory controllers seems to work pretty well.  I've 
never seen this big of a difference from actively using prefetching, 
but then again this is the first time I've run tests on real app code 
when looking at it.

In my experience a far bigger impact can be seen from the kind of 
memory move instruction you use to access the data, what caches the 
data turns out to be in, and how many streams you're trying to access 
concurrently.  For instance using a non-temporal move in the wrong 
place can really wreak havoc with performance, so it's a bad idea to 
use in generic code paths IMO.

Cheers,

Joonas