[Pixman] [PATCH 05/14] ARMv6: Force fast paths to have fixed alignment to the BTAC

Siarhei Siamashka siarhei.siamashka at gmail.com
Mon Oct 14 19:39:27 PDT 2013

On Wed,  2 Oct 2013 00:00:25 +0100
Ben Avison <bavison at riscosopen.org> wrote:

> Trying to produce repeatable, trustworthy profiling results from the
> cairo-perf-trace benchmark suite has proved tricky, especially when testing
> changes that have only a marginal (< ~5%) effect upon the runtime as a whole.
> One of the problems is that some traces appear to show statistically
> significant changes even when the only fast path that has changed is not even
> exercised by the trace in question. This patch helps to address this by
> ensuring that the aliasing between the branch predictor's target address cache
> (BTAC) for the remaining fast paths is not affected by the additional, removal
> or refactoring of any other fast paths.

Just curious. Is this BTAC explanation just a speculation or was
it really confirmed by something like monitoring the hardware
performance counters?

The processor in Raspberry Pi has only 16K of L1 instructions cache,
which is 4-way set associative. The L2 cache is slow, and this makes L1
misses very expensive. It could be that the collisions within sets in
the L1 instructions cache play a significant role and the change of
relative locations of the "hot" parts of code between recompilations
may cause results distortion.

If it's BTAC, then the branch prediction miss rate is going to be
affected. If it's the collisions when allocating the data in the
instruction cache, then the I-cache miss rate is going to be affected.

The branch prediction can be also disabled by patching the kernel for
the sake of experiment. Either completely disable branch prediction:

diff --git a/arch/arm/mm/proc-v6.S b/arch/arm/mm/proc-v6.S
index ae1cb16..c3cd3d5 100644
--- a/arch/arm/mm/proc-v6.S
+++ b/arch/arm/mm/proc-v6.S
@@ -259,7 +259,7 @@ __v6_setup:
 	.type	v6_crval, #object
-	crval	clear=0x01e0fb7f, mmuset=0x00c0387d, ucset=0x00c0187c
+	crval	clear=0x01e0fb7f, mmuset=0x00c0307d, ucset=0x00c0187c
Or just disable dynamic branch prediction only (so that the static
branch prediction still works):

diff --git a/arch/arm/mm/proc-v6.S b/arch/arm/mm/proc-v6.S
index ae1cb16..ac30e73 100644
--- a/arch/arm/mm/proc-v6.S
+++ b/arch/arm/mm/proc-v6.S
@@ -249,6 +249,11 @@ __v6_setup:
 	mcreq	p15, 0, r5, c1, c0, 1		@ write aux control reg
 	orreq	r0, r0, #(1 << 21)		@ low interrupt latency configuration
+	mrc	p15, 0, r5, c1, c0, 1		@ load aux control reg
+	bic	r5, r5, #(1 << 1)		@ disable dynamic branch prediction
+	mcr	p15, 0, r5, c1, c0, 1		@ write aux control reg
 	mov	pc, lr				@ return to head.S:__ret

The documentation about the control registers and their bits can be
found here (Z bit for the branch prediction and DB bit for the dynamic
branch prediction):


There are also many other interesting configuration knobs to play with.

Regarding the results reproducibility. On a different ARM hardware
I have also observed a major cairo-perf-trace results fluctuation
between multiple runs when the memory controller was apparently heavily
taxed by the 1920x1080 60Hz 32bpp framebuffer scanout. Enabling huge
pages support improved the cairo-perf-trace score and made the results
reproducible. I have applied the hugetlb patches to the Raspberry Pi
kernel for running some simple tests/benchmarks, and these patches
can be still found in the following git branch:


This is only usable via hugectl (no transparent huge pages), but
cairo-perf-trace runs via hugectl just fine. The transparent huge
pages support can be backported to the Raspberry Pi kernel too, if
anybody finds it useful.

> The profiling results later in this patch series have been calculated with
> this switch enabled, to ensure fair comparisons. Additionally, the
> cairo-perf-trace test harness itself was modified to do timing using
> getrusage() so as to exclude any kernel mode components of the runtime.
> Between these two measures, the majority of false positives appear to have
> been eliminated; the remaining ones tend to fall below 2% change, so any
> such measurements have been excluded from the reports.
> ---
>  pixman/pixman-arm-simd-asm.S |    3 +++
>  pixman/pixman-arm-simd-asm.h |    9 +++++++++
>  2 files changed, 12 insertions(+)
> diff --git a/pixman/pixman-arm-simd-asm.S b/pixman/pixman-arm-simd-asm.S
> index c209688..259fb88 100644
> --- a/pixman/pixman-arm-simd-asm.S
> +++ b/pixman/pixman-arm-simd-asm.S
> @@ -611,3 +611,6 @@ generate_composite_function \
>  /******************************************************************************/
> +#ifdef PROFILING

I just wonder if it perhaps makes sense to expose this build
configuration as a pixman configure option and document it?

There is already "--enable-static-testprogs" option, which allows
to build pixman tests as static binaries. Which is useful for
crosscompilation and running tests in QEMU. Or for uploading and
running them on real hardware, but in a non-glibc environment
such as Android.

> +.p2align 9

Was this particular alignment selected in some experimental way when
trying to minimize the measurements noise or just arbitrarily chosen
and verified to be good?

> +#endif
> diff --git a/pixman/pixman-arm-simd-asm.h b/pixman/pixman-arm-simd-asm.h
> index 4c08b9e..c7e5ca7 100644
> --- a/pixman/pixman-arm-simd-asm.h
> +++ b/pixman/pixman-arm-simd-asm.h
> @@ -54,6 +54,12 @@
>   */
>  /*
> + * Determine whether we space out fast paths to reduce the effect of
> + * different BTAC aliasing upon comparative profiling results
> + */
> +#define PROFILING
> +
> +/*
>   * Determine whether we put the arguments on the stack for debugging.
>   */
>  #undef DEBUG_PARAMS
> @@ -590,6 +596,9 @@
>                                     process_tail, \
>                                     process_inner_loop
> +#ifdef PROFILING
> + .p2align 9
> +#endif
>   .func fname
>   .global fname
>   /* For ELF format also set function visibility to hidden */

Best regards,
Siarhei Siamashka

More information about the Pixman mailing list