[Pixman] RFC: Pixman benchmark CPU time measurement

Tue Jun 2 22:38:25 PDT 2015

On Tue, 2 Jun 2015 10:32:35 +0300
Pekka Paalanen <ppaalanen at gmail.com> wrote:

> Hi,
> 
> most pixman performance benchmarks currently rely on gettime() from
> test/util.[ch]:
> - lowlevel-blt-bench
> - prng-test
> - radial-perf-test
> - scaling-bench
> 
> Furthermore, affine-bench has its own gettimei() which is essentially
> gettime() but with uin32_t instead of double.
> 
> double
> gettime (void)
> {
> #ifdef HAVE_GETTIMEOFDAY
>     struct timeval tv;
> 
>     gettimeofday (&tv, NULL);
>     return (double)((int64_t)tv.tv_sec * 1000000 + tv.tv_usec) / 1000000.;
> #else
>     return (double)clock() / (double)CLOCKS_PER_SEC;
> #endif
> }
> 
> This definition of gettime() has several potential drawbacks:
> 
> 1. clock() will wrap around often, the manual page warns that in some
>    cases it wraps around every 72 minutes. As the code in Pixman never
>    expects a wraparound, this is subtly broken. This is a fallback path
>    for systems that do not provide gettimeofday(), so it is rarely used
>    if at all.

Yes, the clock() fallback is there just to fill the void and make this
code compile on the systems, which don't have gettimeofday().

It also measures the CPU time instead of wall-clock time, but this
can't be helped.

> 2. gettimeofday() measures wall-clock time, which might not be the best
>    to measure code performance on a CPU, because all other load in the
>    system will affect the result. It's probably not a significant
>    problem on fast systems where you know to run your benchmarks
>    uncontended.

In my opinion, wall-clock time is exactly what we need to measure.

If you have something else running concurrently in the system, the
benchmark results are already screwed. Relying on the OS scheduler
to correctly account the CPU time still does not compensate, for
example, the cache thrashing effects.

Basically, with the wall-clock time measurements it is easier to
notice that something is wrong if the benchmark is not run uncontended.
It happened to me a few times to accidentally run benchmarks with the
compilation running in the background. Fortunately, the results were
very much off and this was sufficient to raise a suspicion :-)

Maybe trying to measure both wall-clock and CPU time at the same time
and comparing the results in the end would be an even better safety
guard?

> 3. gettimeofday() is not only subject to NTP adjustments but is also
>    affected by setting the system clock. IOW, this is not a guaranteed
>    monotonic clock. Again, unlikely to be a problem in most cases, as
>    benchmarks run long enough to even out NTP skews, but short enough to
>    avoid accidentally hitting clock changes. (Or so I would assume.)

Yes, this is unfortunate. But this is hardly the worst problem.

> 4. Using double to store an absolute timestamp is suspicious to me. In
>    the end, you always compute the difference between two timestamps,
>    and using a floating point type may cause catastrophic cancellation
>    [1] depending on absolute values. However, [1] also explains that a
>    double is enough. But, given that we read an arbitrary system clock
>    whose starting point is unknown (ok, Epoch for the moment), we might
>    already be getting values too large to maintain the expected
>    accuracy (for floats, sure; for doubles, who knows?)

The double type for storing the time in seconds (since Epoch) was
selected on purpose.

First, there is no accuracy problem. The time in seconds since
Epoch is going to fit in a 32-bit value at least until 2038. The
mantissa in double has 53 bits. Which leaves us with 53 - 32 = 21 bits
for the fractional part. As such, the granularity of the time
representation can be estimated as 1 / 2^21 = ~0.12 nanoseconds.
Which is roughly equivalent to one clock cycle of a hypothetical
8 GHz processor. The calculations may be off by one here or there,
but we are interested in the order of magnitude. This is more than
enough.

Second, the floating point representation is much better for handling
timestamps. One reason is that we avoid overflow problems (the
intermediate calculations may potentially have unsafe multiplications
by the CPU clock frequency, etc.). And if we want to calculate things
like the standard deviation, then the floating point representation is
also much more convenient. Another reason is that the printf style
functions work nicely with doubles, but have some portability
inconveniences with 64-bit integers.

This was the reason why obtaining the time (in a potentially system
dependent way) was abstracted in the gettime() function and the rest
of the code only works with convenient doubles.

Thanks for mentioning the new gettimei() function. Was there a
justified reason to introduce it instead of using gettime()? 

> I would propose the following:
> 
> - runtime clock selection with this priority order:
> 	1. clock_gettime(CLOCK_PROCESS_CPUTIME_ID)
> 	2. getrusage(RUSAGE_SELF) -> rusage.ru_utime (user time)
> 	3. gettimeofday()
> 	4. clock()
>   Naturally with build time checks, too. For 3 and 4 would print a
>   warning about inaccurate measurements. clock_gettime(CLOCK_MONOTONIC)
>   is not in the list because I would assume getrusage() is more widely
>   available and I'd like to use process time before wall-clock delta.
> 
> - A separate void init_gettime(void) for choosing the clock.
> 
> - void gettime(struct timespec *ts) for reading the clock.
> 
> - double elapsed(const struct timespec *begin, const struct timespec
>   *end) for getting the elapsed time in seconds.

This API looks excessively complicated to me.

> In my experiments on the Raspberry Pi 1 [2], it seems to me that
> clock_gettime(CLOCK_PROCESS_CPUTIME_ID) is the most accurate CPU time
> measurement, getrusage() being possibly rounded to HZ ticks. If neither
> is available, I'll just forget about accuracy and get whatever we can
> get, essentially what the current gettime() does but without the
> floating point difference and the wraparound accounted for (at least
> avoid negative elapsed time, which might break algorithms).
> 
> What would you think about this scheme?
> 
> I am not making a promise to implement this any time soon, I would just
> like to hear a few opinions whether it is worth even considering.
> 
> This is also like a public note for myself, in case I need to think
> about these again. My difficulties with benchmarking Pixman on the Rpi1
> is what prompted this, but this wouldn't solve most of the problems I
> have there.

I'm afraid that focusing on the time measurement accuracy can't
provide any real practical improvements. Even if it is measured
very accurately (and I believe that we are already doing it good
enough), the actual running time is non-deterministic and depends
on many factors in the modern multitasking operating systems.

One major problem at least on ARM devices is the fragmentation of
physical memory pages. If you allocate a memory buffer, which is
contiguous in the virtual address space, this does not mean that
this buffer is also contiguous in the physical memory. Moreover,
we have no idea how to get and take into account this information
from the userspace. The actual layout of the buffers in the
physical memory does affect performance because the CPU data caches
have limited associativity and the DRAM is organized in banks.
Using huge pages can mitigate the issue, but this feature is not
in a very good shape on ARM (huge pages are only supported in LPAE
mode, but LPAE is not supported on older processors and uses 3-level
page tables instead of 2-level page tables, which makes it somewhat
slower). Still we can introduce huge pages support in the pixman
benchmark programs. It will help on certain devices with certain
kernels.

Also the USB hardware is not great in the Raspberry Pi and it was
generating tons of interrupts at least some time ago. This parasitic
activity can interfere with the running tasks and make the benchmark
results differ across multiple runs. You can watch the counters
in /proc/interrupts to see if this is still the case.

Changing one part of the code and recompiling the library may change
relative alignments of many unrelated code parts. And the CPU
instruction cache also has limited associativity.

Measuring small differences is going to be always problematic because
of the less than perfect signal/noise ratio.

> Thanks,
> pq
> 
> [1] https://randomascii.wordpress.com/2012/02/13/dont-store-that-in-a-float/
> 
> [2] the test program:
> https://git.collabora.com/cgit/user/pq/pixman-benchmarking.git/tree/src/timings.c?id=795db042f44a12147de7449f47da901670733f71
> was running over a weekend, generating 78k samples (a big web page!):
> https://git.collabora.com/cgit/user/pq/pixman-benchmarking.git/commit/?id=795db042f44a12147de7449f47da901670733f71

-- 
Best regards,
Siarhei Siamashka